DEPARTMENT OF STATISTICS 
UNIVERSITY OF LONDON, UNIVERSITY COLLEGE 


STATISTICAL RESEARCH 
MEMOIRS 


Edited by 
J. NEYMAN and E.S. PEARSON 


VOLUME I- 


LONDON 


June 1936 


Wey 


fe 


STATISTICAL RESEARCH MEMOIRS 


emcee St cite oe 


coed 
. 7 ae : ~~ 


DEPARTMENT OF STATISTICS 
UNIVERSITY OF LONDON, UNIVERSITY COLLEGE 


STATISTICAL RESEARCH 
MEMOIRS 


Edited by 
J. NEYMAN and E. S. PEARSON 


VOLUME I 


ISSUED BY THE DEPARTMENT OF STATISTICS 
UNIVERSITY OF LONDON, UNIVERSITY COLLEGE 


AND PRINTED AT 
THE UNIVERSITY PRESS, CAMBRIDGE 


PRINTED IN GREAT BRITAIN 


To 
the memory of our Professor 
KARL PEARSON 
27 March 1857—27 April 1936 
Founder of mathematical statistics, originator of 
its various applications, teacher and inspirer of 


innumerable research workers of many nations 
and races. 


[Some account of the life and work of Karl Pearson 
will appear in a coming issue of Biometrika] 


Mins 


FOREWORD 


The present series of Statistical Research Memoirs are being published in con- 
tinuation of a tradition established in 1904 with the issue of the first Drapers”’ 
Company Research Memoir. For a period of 30 years a succession of publications, 
of semi-periodical character, edited by Karl Pearson, were issued from University 
College, London, first from the Department of Applied Mathematics and since 
1911 from the Department of Applied Statistics which incorporated the Biometric 
and Eugenics Laboratories. The range of subjects included in these publications 
was exceedingly broad; the theory of statistics, various tables to facilitate 
statistical work, research memoirs in the field of biometry, of medicine, of 
anthropology and of eugenics. 

On Karl Pearson’s retirement nearly three years ago, there followed a re- 
organization of the Laboratories.of which he had been the head, into a Department 
of Eugenics and a Department of Statistics. This circumstance, which has in- 
evitably resulted in a greater specialization in the latter Department, provides 
one reason for a difference in character between the present series of memoirs and 
those formerly edited by Karl Pearson. Other factors have, however, contributed 
to this difference. It is widely felt that in spite of the existence of a large number of 
special problems for which perfect solutions exist, statistical theory in general in 
its present state is far from being completely satisfactory from the point of view 
of its accuracy. It is the ambition of the Department to contribute towards the 
establishment of a theory of statistics on a level of accuracy which is usual in 
other branches of mathematics. 

It has seemed to us that efforts in this direction by the workers in the Depart- 
ment of Statistics will be more fruitful if papers on the subject are published 
mainly in one place, rather than spread over a number of journals devoted either 
purely to mathematical subjects or to what may be termed general statistics. In 
this way it will be easier for those interested in the subject to find the work and 
to judge, as a whole, a number of investigations linked by a common purpose and 
often following a common approach. Since statistical papers dealing with applica- 
tions are more useful when they are easily accessible to persons who may apply 
in practice the methods advanced, such papers prepared in the Department, if 
they are not too mathematical, will be published mainly in journals concerned 
with the particular fields of application and only rarely in the present series. 

The Statistical Research Memoirs will contain only papers prepared in the 
Department of Statistics; while the series is not strictly periodical it is hoped that 
a volume of over 150 pages will be issued about once a year. 


J. NEYMAN E. §8. PEARSON 
Reader in Statistics Head of Department of Statistics 


UNIVERSITY COLLEGE, LONDON 


Digitized by the Internet Archive 
In 2023 with funding from 
Kahle/Austin Foundation 


https://archive.org/details/statisticalresea0O0ijney 


CONTRIBUTIONS TO THE THEORY OF TESTING 
STATISTICAL HYPOTHESES 


I. UNBIASSED CRITICAL REGIONS OF TYPE A AND TYPE A, 


By J. NEYMAN anp E. S. PEARSON 


CONTENTS 
PAGE 
1. Introductory ‘ : : : : : : : : : : a 1 
2. The choice of a critical region 5 : : ; : : : : : : : 6 
3. Unbiassed critical regions . . : : c : ¢ 8 
4. The determination of an unbiassed Eetical region of iype an : : : : : 10 
5. Simplification of solution in terms of $=2 log p/00 é : ‘ 5 : ; E 12 
6. Illustrative examples. : 15 
7. The invariance of an unbiassed region of type A wit regard to frarclormations of the 
unknown parameter : : : : : 6 26 
8. The determination of an snbiaceed! oritical region a type A, : ; 5 : ; 27 
9. Summary . : : : : : ; : : : : é 30 
APPENDICES 
I. The conditions of regularity : : é : : 31 
II. Calculation of the limits of the type A mibiadsed ected Teuioa for Ds : : 5 35 
1. Introductory. 


In a number of recent publications we have discussed a method of approach to 
the problem of testing statistical hypotheses starting from a simple concept which 
may be expressed as follows: arrange your test so as to minimize the probability 
of errors. Since the errors involved in testing hypotheses are of two kinds, the 
problem requires further specification which may have different forms. One of 
these has led to the theory of uniformly most powerful tests, but as it.has been 
found that in many situations a solution along these lines is impossible, it follows 
that in such cases the problem of minimizing the probability of errors must be 
specified in a different form. This fresh specification will be discussed in the 
present paper, but we believe that it will be useful first to give a brief review of 
the general method of approach. 

In the first place it must be remembered that, as in all branches of applied 
mathematics, it is necessary to set up a precise but generally simplified model 
which we believe represents the world that we observe with sufficient accuracy 
to provide us with useful results. The methods of statistical analysis may be 
regarded as tools constructed to operate on this model and, as with all tools, their 
parts and uses must be described in technical terminology. Further, since a tool 
is something to be used on many occasions under conditions which are similar 
but not precisely identical, it is important to consider what are the common 
elements of the practical problems to which a statistical test is to be applied. 


PSM 


2 Contributions to the Theory of Testing Statistical Hypotheses 


In the construction of these tests we must distinguish between those steps 
which can be established by mathematical theory and (on the assumption that 
this theory is correct) are therefore not open to question, and other steps in the 
making of which the statistician has a freedom of choice. In a given set of con- 
ditions mathematical theory may show that definite consequences will follow 
from the application of a statistical test; these consequences may or may not 
appeal to the statistician as desirable. If they do not, he can adjust'the form of 
test so that the results which can be shown to follow by mathematical reasoning 
are such as he can accept as of value in practical application. There is here in- 
evitably present a personal element, since clearly established consequences which 
appear satisfactory to one individual may not be so regarded by another. The 
controversies that have troubled the world of mathematical statistics seem often 
to have arisen from a failure to distinguish clearly between these subjective and 
objective aspects of the problem. 

Before giving mathematical precision either to the model or the statistical 
tools, it will be useful to illustrate these ideas on a simple type of situation. Two 
examples will be considered. 


Example I. A manufacturer of electric lamps wishes to ascertain, from tests 
carried out on a small number of lamps selected at random, whether a batch of 
1500 lamps is likely to satisfy that condition in a specification which lays down 
upper and lower tolerance limits for the ‘‘initial efficiency ’’,* say x, of individual 
lamps. 


Example II. A pharmacologist is asked to report whether a new and cheaper 
form of insulin is of inferior quality to an older standard. The experiments are 
carried out on rabbits and consist in the determination of a certain reaction, 2; 
low values of x are associated with inferior quality. 


It is very likely that these two problems would contain common elements as 
follows: 


(1) As a result of the tests a decision must be reached determining which of 
two alternative courses is to be followed. Thus the manufacturer must decide 
whether (A) to place the batch of lamps on the market, or (B) to test a second 
sample. The pharmacologist must decide whether to report, (A) “I can detect no 
indication that the cheaper insulin is worse than the more expensive one and 
therefore recommend the use of the cheaper insulin if certain further tests prove 
satisfactory”; or (B) “The cheaper insulin seems to be of inferior quality and I 
cannot advocate its use.” 

(2) Since the decision must be made on a limited amount of data,+ it is in- 
evitable that it will sometimes differ from that which would have been made were 


w The “initial efficiency” expressed in lumens per watt is a measure of the lamp’s efficiency as 
a light producer; it depends largely on the quality of the filament. 
+ For example, economic considerations will govern the number of lamps which are tested. 


J. NEYMAN AND E. S. Pearson 3 


fuller data available; it may therefore be described as liable to error. Two different 
faulty decisions or errors may be made; course (A) may be followed where, with 
more complete information, (B) would have been chosen, and vice versa. 


(3) Very different consequences will follow from these two errors and therefore 
they should be distinguished. On the one hand, for example, the manufacturer 
may suffer in reputation by allowing a faulty batch of lamps to go on the market, 
and on the other he may incur unnecessary expense in testing a second sample of 
lamps or even in destroying the whole batch when it is in fact up to standard. 

(4) In both problems it is likely that the situation may be represented by the 
same mathematical model. Thus past experience may have shown that the 
distribution of the tested character, x, whether in a single batch of lamps or in 
rabbits from a laboratory population receiving the same treatment, is approxi- 
mately normal having a known standard deviation not dependent upon the mean. 
The model then defines as the probability law for x 

_l@-& 
Soo tall ian (1) 


p (x)= 


270 
and the statistical problem is to draw from n observations 2, , Xj, ... Z, inferences 
regarding the unknown value of the mean, &. 

In the manufacturer’s problem a specified value, €,, represents his objective 
mean initial efficiency for this type of lamp, while for the pharmacologist £ is 
the mean reaction using the standard form of insulin. Beyond this there is clearly 
some difference between the two situations. The manufacturer does not expect 
his objective always to be maintained, nor does he mind if the mean level has 
shifted to €,, provided that, let us say, the values of €, — 30 and €, + 3c fall within 
the tolerance limits set by the specification for individual lamps. The pharma- 
cologist, on the other hand, may be almost certain that the new insulin cannot 
be better than the old, but is anxious to detect whether it is worse, ie. to detect 
the existence of any positive difference, £)—&,. 

Both, however, will be satisfied if the statistical test can wer them with a 
rule which will very rarely result in the choice of course (B) when = , and will 
be increasingly likely to suggest course (B) as é differs more and more from é. 
It is to give precision to the statistical tool used in dealing with situations of this 
kind that we have introduced a formal terminology which is convenient if under- 
stood in the sense defined. 

Thus in the present instance we are concerned with a test of the statistical 
hypothesis, H,, that £=£). The class of admissible hypotheses is defined by 
equation (1), where it is supposed that o is known. The manufacturer is concerned 
with alternative values of ¢ both above and below é), while the pharmacologist 
wishes to concentrate attention on the range é < £,. The model which only allows 
for uncertainty as to £ may be inadequate to represent the situation, but this is 
another question; the test has been constructed to fit the model and it may lose 


I-2 


4 Contributions to the Theory of Testing Statistical Hypotheses 


its meaning if this is inaccurate or over simplified. The responsibility of choosing 
the model rests, however, with the statistician applying the test, who, in this case, 
would have to decide for himself whether he is running any serious risk in assum- 
ing the variation in 2 to be normal and o known.* 

If, in following the rule provided by the test (which in this case would probably 
relate to the mean Z of the observations in the sample), course (A) is adopted, this 
has been described as the acceptance of the hypothesis, Hy, that = &. If course 
(B) is adopted it is said that H, is rejected. If H, is rejected when, in fact, =&, 
it is said that H, is rejected when it is true and that an error of the first kind has 
been made. If H, is accepted when é# 5, it is said to have been accepted when 
some alternative is true, and that an error of the second kind has been made. 
In any given case it is impossible to determine whether a hypothesis is true or 
false, but it is possible to assess the efficiency of a statistical test by the manner in 
which it controls these two sources of error on repeated application in the situation 
defined by the mathematical model. 

The method of approach that we have followed in a series of previous paperst 
has been to consider how these two sources of error may be most effectively con- 
trolled. We shall now give an outline of the results obtained and recall certain 
definitions which will he used below. 

Denote generally by x1, %, ... x, a system of variables, the values of which can 
be given by observation. Any system of values of the x’s will be represented by a 
point, #, in the n-dimensioned space, W. The point Ht and the space W will be 
called the sample point and the sample space respectively. 

If the variables z,, ... x, have the property that, whatever the region w in the 
sample space, there exists a number, P {Hew}, representing the probability that 
the sample point £ will fall within the region w, then we shall describe these x’s 
as random variables. The probability P {Hew} considered as a function of the 
region w will be described as the integral probability law of the z’s. Any assumption 
concerning the nature of the integral probability law P {Hew} is called a statistical 
hypothesis. The statistical hypothesis is said to be simple if it determines P {Hew} 
as a single-valued function of the region w. Any statistical hypothesis which is 
not simple is called composite. 

In our papers quoted we have assumed, and this assumption will be kept 
throughout the present paper, that there exists a function, p(x, ...x,)=p(E), 
defined, non-negative and continuous in almost any point§ of W such that 


* Investigations into the adequacy of the statistician’s models to represent the phenomena that 
he observes have, of course, been made from time to time. 

t J. Neyman and E. §. Pearson, (1) “On the problem of the most efficient tests of statistical 
hypotheses”, Phil. Trans. Roy. Soc. A, vol. ooxxx1 (1933), p. 289; (2) “The testing of statistical 
hypotheses in relation to probabilities a priori”, Proc. Camb. Phil. Soc. vol. xxx (1933), p. 492. 

t In our previous publications we denoted the sample point by £. This, however, was found to be 
inconvenient because of the possible confusion with the sign of summation. The letter F is suggested 
by the possible term “event” point. 

§ I.e. except perhaps for a set of points of measure zero. 


J. NEYMAN AND E. S. Pearson 5 


whatever the region w, the value of P {Lew} is equal to the integral of p (Z) taken 
over that region w. The function p(Z) thus defined is called the elementary 
probability law of the x’s. Under the assumption that the elementary probability 
law exists, we may say that a statistical hypothesis means any assumption con- 
cerning the nature of the elementary probability law of the 2’s. 

As it was shown in our previous publications and as is apparent from the 
preceding section, any test of a statistical hypothesis, Hy), may be considered as 
equivalent to a rule of rejecting Hy whenever the sample point falls within a 
specified region, w, called the critical region, and in accepting it in all other cases. 
We have termed the probability of the first kind of error determined by H, the 
size of the corresponding critical region w; thus if the size be « and the hypothesis 
tested, H,, determines the elementary probability law of the z’s, say 


Peas Bel Voy <2. Ug) =P ( 2, 5050 Fp, | tg), (2) 
then P{Eew | H}={...[ Dp (2, ---Lq | 05%, ... 06°) dx, ... dx, =m, (3) 
Ww 


where 6%), ...0@ denote parameters involved in the elementary probability law 
of the x’s and 6,%,... 6, their values specified by H,. Two tests based on critical 
regions of the same size have been called.equivalent. The probability of rejecting 
H, when an alternative H’ is true has been termed the power of the test with 
regard to H’. This may be written P {Hew | H’}. 

The most powerful test for H, with regard to H’ is the test whose power is 
greater than that of any other equivalent test, and the critical region associated 
with this test has been termed the best critical region for Hy with regard to H’. 
If w be such a region, of size «, then we have shown that it will be defined by the 
Aloha Pp (X1, +++ %y | HM’) > kp (2, ...X,_ | Hy) within w 
and p (a1, --» %y, | H’) <kp (ay, ...%,_ | Ho) outside a 


where & is a constant determined so that (3) is satisfied. 

If the region, say wo, determined by the inequalities (4) is independent of the 
alternative hypothesis H’, it has been termed the best critical region for Hy 
common with regard to the whole class, 2, of admissible alternatives, and the test 
has been called a uniformly most powerful test. Starting from the same in- 
formation, it is difficult to see how any better test than this could be found. As we 
have shown, a number of important statistical tests fall into this category. 

When, however, the region of size « determined by (4) is not the same for every 
alternative hypothesis belonging to ©, further considerations must be introduced 
in determining our choice of a critical region. It is no longer possible to find a 
single region which minimizes the risk of accepting H, falsely whatever alter- 
native, H’, be true; nevertheless we believe that the choice of test should still be 
made to depend, in some way, on its control of the second kind of errors. In the 
present paper we shall discuss this problem of determining a “good critical region” 


(4) 


6 Contributions to the Theory of Testing Statistical Hypotheses 


in situations where no uniformly most powerful test exists, confining attention 
to the case where the hypothesis tested concerns only the value of a single un- 
known parameter. The specification of the region to be described can, however, 
be extended to cover other situations, and with these we shall hope to deal 
shortly in a further publication. 


2. The choice of a critical region. 

As we have mentioned in the introductory section, in the development of the 
theory of applied statistics it is important to distinguish between those steps 
which can be established by mathematical theory and (on the assumption that 
this theory is correct) are therefore not open to question, and other steps in the 
making of which the statistician has a freedom of choice. 

In selecting a critical region in a case where no uniformly most powerful test 
exists, a situation of this kind must be faced, and we think it will be useful to 
introduce the subject with a simple illustration. 


Example III. Suppose that it is known that a sample of n individuals, x,,...Z,,; 
has been drawn randomly from some normal population with mean é and standard 
deviation o, so that the probability law of all the x’s is given by 


P(t, --%y|0)=(=) eo BF (5) 


Assume that the set of admissible hypotheses, Q, contains all hypotheses ascribing 
to the two parameters in (5) any values —00<&< +00 and 0<ca. The hypothesis 
tested, Hy, is that c=a,, and if € is not known H, is a composite hypothesis. 
We have shown that in this case there is no uniformly most powerful test with 
regard to the whole class of alternatives, 0<o.* The test commonly employed in 
this situation consists in a rule for rejecting H, either if s* <8," or s?>38,?, where 
s* is the sample variance defined by 


st (4j-2)n, B= E (dn, (6) 
i=1 i=1 
and s,? and s,” are determined so that 
a= P{s*<s,?| Ho}, a= P{s*>s,?| Ho}t (7) 
and Oy = % = $a. (8) 


* Phil. Trans. Roy. Soc. A, vol. coxxxi (1933), p. 321. One common best critical region exists 
with regard to the class 0 <o<o , and a second for the class o> op. 
t The probability law for s* when o?=o,? is 


n—3 , 
Girt)? © mae 
2 pr oh ioe EE 2a 
p (s* | Hp) Timi) at e ’ 
and therefore aie I * pie ao at ae I ” p (et | Hy) de®. 
0 s,? 


These integrals may be readily obtained from Tables of the Incomplete Gamma Function, edited by 
Karl Pearson, Biometrika Office, University College, London. 


J. NEYMAN AND E. S. Pearson 7 
If this test is used the probability of rejecting H, when it is true is 
Oy 3B bg =a, (9) 
but the same control over the first kind of error could be assured if (9) were true 
but not (8). We may ask therefore what is the justification, beyond traditional 
practice, of taking «,=«a,, or critical levels cutting off equal tail areas from the 
probability distribution of s?? Let us inquire what is the power of this test with 
regard to an alternative H’ specifying o 4 oy; for this purpose we must calculate 
P {Kew | H'}=1— P {s,2 <8? <s8,?| H’} (10) 


TEST OF H, (@- a =1:0) USING EQUAL TAIL AREAS. 
O&,= Oo, + -O1, N= 3. 


& 
n 
wm: 
& 
al 
° 
=e 
: 
° 
Pee 


1:0 15 
IN SAMPLED POPULATION. | 


Fig. 1. 


for the values of s,? and s,? defined by (7) and (8). This can be easily done for any 
given values of n, «, o and o using the Tables of the Incomplete Gamma Function. 
The result of calculations made in the special case 
n=3, a=002, oj=1 

are shown in Fig. 1, in the form of a curve for which the abscissa is o? and the 
ordinate is the power of the test or chance of rejecting Hy (c=) when an alter- 
native H’ (o=o’) is true. 

A curious anomaly is at once evident, for we see that for 0-5<o?<1-0 the 
power of the test is less than 0-02. If Hy be true and o*=o,"= 1-0, the chance of 
rejecting this hypothesis is, as planned, P{Hew| H)}=«=0-02. If, however, the 


8 Contributions to the Theory of Testing Statistical Hypotheses 


sample had in fact been drawn from a population in which o? lay between 0-5 
and 1-0, we should be less likely to reject the hypothesis tested (i.e. more likely 
to accept it) than if it were true ! 

It can be shown that a similar situation exists for any and any « if the test 
regarding o is based on equal tail regions (i.e. if «= «,), although as m increases 
the range of values of o for which the power of the test is less than « decreases 
steadily. 

The result illustrated by the curve of Fig. 1 is an objective result established 
by mathematical theory as a necessary consequence of using a rule for rejection 
based on the equality (8), «;=a,. In interpreting this result there is freedom of 
choice. The individual statistician may accept the consequences as unobjection- 
able; he may reason, perhaps relying largely on intuition, that in his opinion the 
use of critical levels, s,? and s,*, based on equal tail regions is in some ways more 
fundamental than the condition that the power of the test should never be less 
than «; or he may search for a modified form of test which will avoid this anomaly. 

It is our own personal view that a test so designed that when certain alternative 
hypotheses are true, the hypothesis tested is more likely to be accepted than when 
it is itself true, cannot be regarded as satisfactory, since the consequences appear 
to offend certain common-sense principles. We shall therefore consider in the 
following sections how far it is possible to develop a systematic method of 
selecting a critical region which will not lead to the situation described. 


3. Unbviassed critical regions. 

The problem which we shall now investigate is that in which all the admissible 
simple hypotheses forming the set Q specify the same function, p(x,,...x, | @), 
representing the probability law of the x’s containing only one parameter, 8, the 
value of which is unknown. The hypothesis tested, Hy, ascribes to 6 the value 
6=6,. Any other admissible hypothesis, 4’, will differ from H, only in the value, 
say 9’ <6, or 6’> 6, which it ascribes to 6. The probability law of the z’s may 
now be denoted by p(E#|@). Whatever be 6’, P{Hew| 6’} will denote the prob- 
ability, determined by the admissible hypothesis which assumes 6= 6’, that the 
sample point # is an element of any region w. 

Ifnow we attempt to avoid a situation similar to that presented in Example III, 
our first objective must be to determine a critical region, w, such that P {Hew | 6} 
has a minimum value of « when 6 = 6p, i.e. when H, is true. A region satisfying this 
condition will be termed an unbiassed critical region, while other regions (or tests 
based on such regions) will be termed biassed. 

The necessary and sufficient conditions for w to be an unbiassed critical region 


are as follows: P {Rew | 0) =«, (11) 
P{Eew | 0}> P {Eew | 0} (12) 

for all admissible values of @ satisfying the condition 
0, =0)—t, <0 <0) +t, =O, (13) 


J. NEYMAN AND E. S. Pearson 9 


t, and t, being certain positive quantities. The interval 0, to 6, may include the 
whole range of values of 6 specified by the admissible hypotheses of Q, and in fact 
this is the case for all unbiassed tests in practical use that we are aware of. Our 
definition only implies, however, that the inequality (12) holds within a finite 
interval containing 6,. 

In general these conditions will not determine a unique region. For instance, 
in the problem of testing the hypothesis that o=o, described in Example III, 
we might use as criterion the sample variance, s? or the range* in the sample. 
In both cases unbiassed critical regions of size « which are not identical can be 
found. 

In certain cases it will be possible to choose from among unbiassed regions of 
size « one, Say Wy, such that 


P {Hew | H'}> P {Hew | H’} (14) 
for every admissible hypothesis H’ of the set Q, alternative to H,, and for every 
region w of the set of unbiassed regions of size «. A region w, having these pro- 
perties may be termed the best unbiassed critical region, and the test associated 
with it, the uniformly most powerful unbiassed test. If we accept the principle of 
choosing an unbiassed test, and if among these a uniformly most powerful test 
exists, then it is difficult to see that this test could be bettered. Of unbiassed tests 
providing equal control of the first kind of error, this test reduces to a minimum 
the risk of the second kind of error, whatever admissible hypothesis alternative 
to H, be true. We must leave the examination of the general theory of such tests 
for later discussion, at present approaching the problem in a form which is mathe- 
matically simpler and which we believe does in fact, in the special examples 
illustrated below, provide an unbiassed test which is also uniformly most powerful. 

We shall now define as an unbiassed critical region of type A, a region w satis- 
fying the equation (11) and also the conditions 

dP {Eew | 9} 

a6 

dial els is a Maximum. (16) 

dé? 0=6, 
On the assumption that the differential coefficients exist, the condition (16) 
implies that in using w, the probability of rejecting Hy when it is not true not only 
increases as |§—6,| increases, but does so in the neighbourhood of #=6, at a 
more rapid rate than if any other critical region satisfying (15) of the same size 
were used. It may, of course, be argued that the behaviour of what we shall term 
the power function, P {Hew | 6}, is of less importance in the neighbourhood of 
6=6, than farther away on either side, because the consequences of accepting H, 
when an alternative hypothesis is true will in general be of less serious import 


ii) im TE) 


6=6, 


* The range is the difference between the largest and smallest of the n values of z. 


10 Contributions to the Theory of Testing Statistical Hypotheses 


when 6 differs only slightly from 6, than when @ is considerably larger or smaller. 
This criticism would, of course, not apply if the type A region (which is necessarily 
unbiassed) was also the best unbiassed region. 

The final verdict of the practical value of the region satisfying pondaiene (11), 
(15) and (16) will depend therefore on the properties of the power function 
throughout the whole range of admissible values of 6. We shall, however, start 
by considering the conditions under which a type A region can be obtained and 
the procedure to be followed in determining it. 


4. The determination of an unbiassed critical region of type A. 
Proposition I. If whatever be the region w in the sample space, the derivatives 
dP {Few | 6} d*P {Hew | 6} 
dé 0=6, dé? 
exist and are represented by the integrals 


(i. ik Op ( ae ae yo de,= [of 99! (B10) dx, de, (cay)*| 


(17) 


6=6, 


- in (18) 
and IE [ee Lido v.de,=[...f pe (EL | 6) dx,...dx, (say)| 
respectively, es oe ea Wy Within which 

Do" (E | 69) > ky po’ (# | 69) + kop (E | 60) (19) 
and outside Which yg” (B | 69) < kya (H |) + kap (E | 6), (20) 
where k, and k, are so chosen that 
el p (EH | 6.) dx,...dx, =a, (21) 
w 
eal po (E | 0) da, ...dx, =9, (22) 


is the unbiassed region of type A defined above. 
The proof of this proposition is a simple consequence of the following Lemma. 
Consider a set of integrable functions Fy, F,, ... F,,, defined in the whole space 
of x, ... %,, and regions w in this space satisfying the conditions 


[sal Fda, :..02,=C, (t=1,2...m), (23) 
w 
where the c’s are some constants. Let w, be one of these regions, within which 
m 
Fy> & (k; F,), (24) 
and outside of which 7 


Fy< = (k,F,), (25) 


* This notation, p,’ (H | 4) and p,” (£ | 6), for the values of the derivatives of p(E | 0) at 0=6, 
will be used below. 


J. NEyMAN AnD E, S. Pearson 11 
where the k’s are constants, so chosen as to allow the region to satisfy the coe 
ditions (23). 

Lemma. Whatever be the region w satisfying the conditions (23), it follows that 


[of Pode. de < ihe Fodx,...dx,, (26) 
Wo 
where wy is the region defined above satisfying (23), (24) and (25). 


Proof of Lemma. The regions w and w, may have a common part; denote this 
by wwy. Further, let w—ww, and wy— ww, denote the part of w which does not 
belong to wy and the part of wy not belonging to w respectively. As both w and w, 
satisfy the conditions (23), it must follow that 


FB) Pydz...de,=[...[ F',dx,...dz,_. (27) 
—WWe ‘ Wo— WW 


Consider the difference, say 


a-|... Fydny... de |...f Rodcewedes 


-|...| Pde, ...de,—{...| Fydx,...dX_. (28) 
Wo— WW —WW, 


Within the region w,—ww,, F) satisfies the condition (24), and within w—wwp, 
the condition (25). It follows that 


a>...f > (Fide, ..de,—|...f E(k, F,)dz,...dx,. (29) 
Wo— WW, i= 1 — ww, i= 


But owing to (27), the right-hand side of the inequality (29) reduces to zero. 
Consequently A> 0 and the Lemma is proved. 


Proof of Proposition I. If the conditions of Proposition I are satisfied, then the 
problem of determining the unbiassed critical region of type A reduces to the 
following: to find among all possible regions, for which equations (21) and (22) 
hold good, the region w, for which the integral 


ike pq! (E| 0) de, ... dar, (30) 
We 
assumes a value which is not exceeded by the value of 
ibs} po (E | Oo) dx, ...dx,y (31) 
w 


corresponding to any other region w satisfying the same conditions (21) and (22). 

Comparing this statement with that of the Lemma, it is found that the role of 
F, is here played by the function p,"’ (# | 6)) and that of F, and F, by p(£ | %) 
and -p9’ (E | 6). It follows that the region wp) required, maximizing (31), is that 
within which (19) is true and outside which (20) is true. Thus the proposition is 
proved. 


12 Contributions to the Theory of Testing Statistical Hypotheses 


In the application of the general method to particular examples, it is of con- 
siderable importance that what may be termed the conditions of regularity of 
the function p (Z | 6) under which the differentials of (17) may be represented by 
the integrals of (18) should be clearly understood, since the solution of the 
problem of unbiassed critical regions contained in (19) and (20) depends upon 
these conditions being satisfied. The meaning of the restriction has therefore 
been described in some detail in an Appendix,* where two illustrative examples 
are given, in one of which the conditions are satisfied and in the other are not. 

Proposition II. If the probability law p(H| 0) satisfies the conditions of 
regularity (a) and (6) of the Appendix (p. 31), and if there exists a sufficient 
statistic 7’ of 0, then the unbiassed critical region of type A is limited by surfaces 
upon which 7 has constant values. 

Proof. Darmoist and Neymant have shown that if T is a sufficient statistic 
of 0, then in every point £ of the sample space, except perhaps in a set of measure 
ane p(E|0)=p(T | 8)f (t,'-- an), (32) 
where f(x,,...2,) is a function not depending on the parameter @. Substituting 
the right-hand side of this equality in the inequality (19) defining the unbiassed 
critical region, we are able to divide both sides by f(x,,...”,) and obtain 

po’ (T | 89) > ky po’ (T | 80) + kep (T | 8) (33) 
in every point of the region. This proves the theorem. 


5. Simplification of solution in terms of ¢=0 log p/00. 
The solution given in the preceding section may be simplified considerably in 
the following way. If we write 


_ Ologp (E | 8) _ @log p(E | 4) 


: Tle Re EGE aia ery, Sau 

we shall have ; : 
Pe (| 4) =p (£ | 4), (35) 
po (E | %)=($' + ¢?) p (E | 6): (36) 


Substituting these two expressions into the inequality (19) defining the unbiassed 
critical region of type A, we obtain 


¢+¢*2>kh Oth, (37) 
in any point where p(H | 6))+0. Further steps in the solution depend upon the 
properties of ¢’ and ¢. 

Case (a). Here ¢’ is a function of ¢, say F (), not involving the 2’s explicitly. 


* See p. 31. 

+ G. Darmois, “Sur les lois de probabilité 4 estimation exhaustive”, Comptes Rendus, t. oc 
(1935), p. 1265. 

{ J. Neyman, “Su un teorema concernente le cosiddette statistiche sufficienti’’, Giornale dell’ 
Istituto Italiano degli Attuari, vol. vi (1935). 


J. NEYMAN AND E. S. Pearson 13 


It may be shown that in this case a sufficient statistic of @ exists.* Here (37) 
becomes an inequality in terms of ¢, 


F($)+ Prk b+ ke, (38) 


and it is seen that the unbiassed critical region of type A, w, is limited by the 
surfaces on which ¢ has certain constant values, say 


¢=c,  (i=1,2,...m), (39) 
where m > 2 is the number of different roots of the equation 
F ($)+¢?—-k,p—k,=0. (40) 
The region w is thus defined by several inequalities of the type 
C:<$<C;,4. (41) 


It follows that the probability of the sample point falling within w is identical 
with that of having ¢ within the limits (41), and is conveniently expressed using 
the probability law p (¢) of ¢. Thus the condition (21) which k, and kg, or in other 
words the c’s of (39), must be chosen to satisfy may be written 


5 i} “2 @)db=a, (42) 


where possibly cy= —oo and c,,,,= +00 and the summation extends over all 
intervals (41) in which (38) is satisfied. The other condition (22) similarly 
becomes 
Cita 
=|" $9 ()dp=0. (43) 
CG 


In practice, the process of finding the unbiassed critical region of type A will 
consist therefore in (i) writing down the inequality (38) and the equation (40), 
(ii) obtaining the solutions (39) in terms of k, and kg, (iii) substituting these into 
(42) and (43), so determining 4, and &, and therefore the limits for ¢, (41). 


Case (b). Simple case of (a). 
The process is particularly simple if 
¢' =F (¢)=A+ Bd, (44) 


where A and B do not depend upon the z’s. Here (38) and (40) immediately 
reduce to 


$?> hgh + ky (45) 

and ¢?—k3¢—k,=0. (46) 
It is seen that (45) is equivalent to the two inequalities 

$i<c, and ¢$22¢, (47) 


* G. Darmois, loc. cit. 


14 Contributions to the Theory of Testing Statistical Hypotheses 


where c, and ¢, are the two roots of (46). These may be found directly from (42) 
nd (43), which now reduce to 


Cy +0 Cy 
[o pores |p @)de=a=1-|"pipyag (48) 
Cy +a , 
and [ose eras |p @)as=o. (49) 


In a number of important cases met with in practice the relation (44) is true, and 


FIG?2. 
for these it follows that the unbiassed critical region of type A is determined by 
C2 
[2 @ap-1-s, (50) 
Cy 
[40 @)as=o, (51) 


and (47). The result (51) will be seen to follow from (49) if it is remembered that 
the integral in (22) is necessarily zero if we substitute the whole sample space W 


+ 00 
for the region w; consequently using (35), { dp (¢) dd =0. 

Case (c). Suppose finally that ¢’ cannot be expressed as a function of ¢ which 
does not involve the «’s explicitly. We shall then have to deal with the inequality 
(37), or in different form $' Shy+k,o—d% (52) 

If ¢ and ¢’ are considered as coordinates of a point HZ’ on a plane, as suggested 


in Fig. 2, it is seen that the unbiassed critical region of type A is represented in 
this plane by the area w’ above the parabola having equation 


$'g) =he thy b—¢?. (53) 


J. NEYMAN AND E. S. Pearson 15 


This parabola depends upon the two parameters k, and k,, whose values must be 
such as to satisfy the conditions 


+ + 00 i +0 + 00 
[48] p@eraaa, [ bap |” p(d.g')dg'=0. (54) 
Pura iby ai *($) 

6. Illustrative examples. 


Example IV. Suppose that the admissible hypotheses assume the following 
probability law of the 2’s: 
ib Ne 5 
p(E|é)=p (ay,...2, | §€)= (—-] ett (—wo<a,< +0; i=1,2,...n). 


TT 
(55) 
In other words the variables x are known to be independent and normally dis- 
tributed with unit standard deviation, but the value of the mean is uncertain. 
The simple hypothesis, Hy, to be tested assumes that ¢=€,, and it is desired to 
determine the unbiassed critical region of type A and size « for testing H,. 

It is shown in detail in the Appendix that the function defined in (55) satisfies 
the conditions of regularity justifying the use of the integrals of (18) in place of 
the differentials of (17). The unbiassed critical region of type A is therefore defined 
by the inequalities (19) and (20); more conveniently we may use the method of 
Section 5, working in terms of ¢. We have 


log p (E | £)= —nlog V 2n— $3 (2;—€)*, (56) 
p=U(x%;,—f), P=—N, (57) 

and consequently the condition (44) is satisfied. Writing & (x,;) =z, we see that 
p=n (%— Sp). (58) 


Clearly we may substitute % for ¢ in the relations (47), (50) and (51). It follows 
that the unbiassed critical region of type A is defined, by the inequalities, say 


B<fy—cy =%,, TEEgt+Co =%, ; (59) 

where | “p(@)dé=1-4, (60) 
Zs — 

[(@-e0p@az=o. (61) 


Since, if H, be true, Z is known to be distributed normally, and therefore sym- 
metrically, about é, with a standard deviation of 1/V'n or 


p(z)=,f5-e ety, (62) 


it follows from (61) that c,’=c,'=c’ =)/WVn, say. Hence the unbiassed critical 
region of type A is determined by the inequalities 


E<&,—A/Vn and ¥>&,+A/Vn, (63) 


16 Contributions to the Theory of Testing Statistical Hypotheses 
where from (60) and (62) it is seen that is obtained from the equation 


om 
—— e* dt=1e 64 
(z= (64) 
and is at once found, for any desired value of «, by using the tables of the normal 


probability integral. 

It is seen that in the present instance the test associated with the type A un- 
biassed critical region is precisely the test in common use based on equal “tail- 
areas” of the sampling distribution of %. This is also the test which follows from 
the use of the likelihood ratio. 

It will also be noticed that since Z is a sufficient statistic for €, the type A region 
will be limited by surfaces upon which % = constant (Proposition II). A knowledge 
of this boundary condition without the introduction of some further principle 
will not, however, suffice to determine which out of the infinite number of regions 
of size « that are limited by surfaces on which Z is constant should be selected. 

In Fig. 3 the power functions associated with six different critical regions of 
size «=0-05, all bounded by surfaces =constant, are represented:* On the 
assumption that o=o, is known,{ any one of these regions would give control of 
the first kind of error if used to test the hypothesis H, that =) =0. The regions 
(a) and (b) are what we have termed the best critical regions that would be 
appropriate to use in testing H, if the admissible alternative hypotheses were 
confined to the set, Q, for which (a) € > &) or (b) §< &. In the case (a), for example, 
where the value of the power function has only to be considered for €>&,, no 
critical region of size « can be found giving a power curve lying at any point above 
the curve shown in the diagram associated with the best critical region. This is 
true whether the alternative region is bounded by =constant or not.{ On the 
other hand, if both classes of alternative hypotheses with €>&, and é<&, are 
admissible, as we have supposed in formulating Example IV, neither of the best 
critical regions are satisfactory; it is seen that, in one direction or the other, the 
ordinates of the power curve tend to zero as | —&,| increases. 

Regions (c), (d), (e) and (f) all satisfy the conditions (11) and (15), and the first 
three are unbiassed, i.e. satisfy (12) also. Clearly region (f) is of no practical value, 
since the power of the test is less than « for every admissible alternative hypo- 
thesis. Region (c) is the unbiassed critical region of type A satisfying (16). As will 
be pointed out in Section 8 the region (c) is also of the form there defined as type A,. 
This is because besides satisfying (11) and (15) it satisfies condition (14) if the 


* The boundaries for the regions are: 
(a) +1-6449, (b) —1-6449, (c) —1-9600 and 41-9600, 
(d) —2-3263, —0-0376 and +0-0376, + 2-3263, 
(e) —2-0537 and —1-6954, +1-6954 and + 2-0537, 
(f) —0-5388 and —0-4677, +0-4677 and +0-5388. 
} In the example just discussed it was supposed that o,=1. 
} For example, by using the sample median we could not hope to obtain a more powerful test. 


¢-O 


£-0 


9-0 


L:0 


8-0 


6:0 


u//°o « LIND ‘“(NOLLvIndod aaIdWvS 4o NVa@aW) §¢ JO WIVOS 
Oz oD OF S-0 2 


"g ‘BI 


oO ¢-O- Ol- S-y- O-<¢- 


-aee eee 
eee we wee ae 


‘NOLLVNV TdX a iS 


(9) 


' 
' 
1 


2 6 FOF V. e- S108) 2 F O f= gs Ge 


‘GAMND WAMOd 


‘NOIDA. 


SO-> 0 ‘NOIDHU WO AZIS ‘NVaW A2TdNVS SI NOIMALIND '0-9$ IVHL SI °H 


aS NOIAY TVOILIYD 


OL LSHL dO UYAMOd WO NOLLV THY, 


C-z2- 


4-0 


ra 0) 


re) 
° 
LSGL dO WHMOd 


8-0 


6:0 


OF 


18 Contributions to the Theory of Testing Statistical Hypotheses 


alternative regions, w, specified in this inequality are limited to those which also 
satisfy (11) and (15). In other words, not only does the power function associated 
with region (c) satisfy the conditions of curvature required at =, but no other 
unbiassed region of the same size satisfying (11) and (15) can be found having 
greater power for any value of ¢ between +00 and —oo. Thus in Fig. 3 its power 
curve will always lie above curves such as those shown, associated with the 
regions (d) and (e). The maximum power of region (e), for example, is with regard 
to an alternative for which é is about 1-90,/Wn, but this is much less than the 
power of the type A region; in the first case the power is about 0-14, in the second 
case about 0:47. In other words, using region (c) if €=&)+ 1-:90,/Vn, we are much 
more likely to detect the fact than by using region (e). 


Example V. Suppose that the admissible hypotheses assume the following 
form of probability law for the z’s: 
E eae 
plE|\c)=p(z,,.-.2,,| o)=(—| e {2 00<2,< +0; 1=1,2,...2), 

(65) 

where a is any real positive number. The variables x are here supposed normally 
and independently distributed about a known mean, taken for convenience at 
the origin, but the value of the standard deviation is uncertain. H, is the simple 
hypothesis which assumes that c=ay. 

It can again be shown that the conditions of regularity are satisfied, and we 
may proceed to determine the unbiassed region of type A following the method 
of Section 5. It is found that 


$= ms (72 (v2) - a) = z (v—n) (say), (66) 


and further that equation (44) is true for ¢. The solution may then be carried out 
in terms of v, the sum of squares of the z’s measured in terms of o,?, instead of ¢. 
The unbiassed type A region will be defined by the inequalities 


v=2(zx,?)/o?<v, and v=X(2,*)/o9*>02, (67) 
ik (v)dv=1—«, (68) 

where ; * 
Lf (v—n) p(v)dv=0. (69) 


Here p (v) is of the well-known form, v being distributed as x? with n degrees of 

freedom, thus 1 : 
= ——__ yt (n—2) p—4 

p (v) aap Baie wo I at 10 (70) 

Equation (69) may be written in the form 


U; Vs 
| vin e-tv do—nf yin-l e-ir'dy=0. (71) 
%) 1 


J. NEYMAN AND E. S. Pearson 19 


On applying to the first member of the left-hand side of (71) the formula for 
integration by parts, we obtain two terms of which one cancels with the second 
member. Consequently we are left with the result 


— 2yi” e-ie " =0, or v,i%e-i=y,!" e-t2, (72) 
(V1 
The unbiassed type A region of size « is therefore defined by (67), where v, and Uv, 
must be determined to satisfy (68) and (72). The process of calculation is not 
straightforward, but can be carried out by using some method of successive - 
approximation. Two methods which have been used in obtaining the results 
presented below are described in Appendix II. 

It will be noted that v is a sufficient statistic for o? and consequently, in accord- 
ance with Proposition II, the boundaries of the tvpe A region must be: defined 
by equations of the form v=constant. 

For the purpose of illustration calculations have been carried out for three 
sizes of sample, viz. n=2, 3 and 9, and for two values of «, viz. 0-10 and 0-02.* 
Table I shows the values of v, and v, satisfying (68) and (72) and the corresponding 
values of 


ev, 


a=] p(v)dv, a= | p(v)de. (73) 


- wo Uz 
It also shows the values of v, and v, for which «,=«,= 4s or in other words the 
“equal tail area”’ limits. These are the ordinary lower and upper 5 per cent. and 


TABLE IL. Particulars of critical regions for testing Hy (a? =a,°). 


a | Critical region a2 U=s ! n=O 
| | | a geo 
| 0-10 | Type A unbiassed | v, Se 1676: | —- | 3-628 
| v, | 7:8643 _ 18-087 
| ay 0-0804 = 0-0658 
ty 0-0196 = 00342 
| | | | 0-1000 —  } 01000 
Equal tailareas | fr, | 01028 | — | 3825 | 
| v | 59915 = 16-919 | 
oe A ee le ee ee ee ee 
| 0-02 | Type A unbiassed | v, | 0:0345 0-160 2-290 
| | (ee | 11-6851 13-453 23-084 
| ¢a, | oo | 00162 0-0140 
| | {2 | 0.0029 0-0038 0-0060 
| | | | 0-0200 0-0200 0-0200 
1 } ‘ [ee eee eee 
| Equaltail areas | fr, | 00201 | 0115 | 2088 
| ; Ue, | 92103 | 11-341 | 21-666 


1 per cent. levels of significance of x? with n degrees of freedom. It will be noticed 
that there are considerable differences between the boundaries of the unbiassed 


* We are verv much indebted to Mr M. R. El Shanawany and Mr C. Eisenhart for assisting in 
these calculations. For 2=3, only the case x =0-02 was taken. 


4-2 
ane 


20 Contributions to the Theory of Testing Statistical Hypotheses 


(type A) and equal tail area critical regions, particularly in the case of the smaller 
samples. Further, for the unbiassed regions, «, is considerably larger than «.. 

The significance of the position is shown more clearly in Figs. 4 (a) and (6), 
where the power function, P {Xew | o}, has been plotted for a number of different 
critical regions. It is supposed that for the hypothesis tested, o)”= 1-0. Turning 
to Fig. 4 (b) the bias in the equal tail area region is particularly noticeable. In 
this case (n = 2), when testing the hypothesis H, that o?= 1-0 we are more likely 
to accept H, if in fact 0:5<o?<1-0 than if Hy be true. The bias when n=9, 
though it is seen to be present in Fig. 4 (a), is relatively of less importance, and the 
larger the sample the more nearly the equal tail area and unbiassed critical region 
will coincide. 

While the former test, leading as it does to this anomaly, cannot in our opinion 
be regarded as altogether satisfactory, from the practical point of view there is no 
serious loss of efficiency except perhaps in the case of very small samples. If, for 
example, in routine testing for the control of quality of some industrial product 
v (or s2) based on tests on n specimens were plotted in a control chart, it might be 
worth while calculating the limits associated with the unbiassed region if n < 10, 
but not for larger samples. We have, however, introduced these simple illustrative 
examples because from other work already in progress we believe that the con- 
ception of an unbiassed region is of much wider application. As in the case of the 
mean already discussed in Example IV, it is possible to show that the unbiassed 
test of type A is also of type A,, in the sense defined on p. 27 below. In other 
words, of critical regions of size « the first derivative of whose power function with 
regard to o vanishes at o=o,, none can be found with a curve lying anywhere 
above that of the type A region. 

As in Fig. 3, the power curves of the two best critical regions of size « have been 
drawn in Fig. 4; the first is determined by the value of v, for which «,=« in (73) 
and the second by the value of v, making «,=«. If, for example, the set of admis- 
sible alternative hypotheses is limited to those for which o? < o,?= 1-0, so that we 
are concerned only with that part of the diagram lying to the left of o?= 1-0, then 
no critical region of size « can be found leading to power curves falling above 
those marked B.c.R. (1). Similarly, if the alternatives are limited to those for 
which o? > o,”= 1-0, the curve for B.c.R. (2) must fall above all others in the right- 
hand half of the diagram. 

It will perhaps be useful to make a few remarks here on the practical aspect of 
these conclusions. In the first place it might be asked: In view of the greater 
efficiency of the two best critical regions why not use B.c.R. (1) when v/n <1 and 
B.C.R. (2) when v/n >1? Such a practice would in effect correspond to the use of 
the equal tail area region with «,=«,=«, i.e. the chance of rejecting H, if true 
would now be 2« and the test would not be equivalent to those already considered. 
To bring it to equivalence we must take «, = «, = 4« and come back to the biassed 
equal tail area region of size « already discussed. 


‘NOILVINdOd Ga@IdWVS NI ,0 “NOILVINdOd CQHTHWVS NI 2,0 
Of St OF FT CL OF 8 D& & @ O O72 8ST OF BEET CF OF 8&8 FF & & O 


ie) 


e 
2.9 


@ 
‘(isa1 40 wamoc) HT DNILOGCAA JO HINVHI 
= 


ad © ie) 
o 9 9 
eu 
°H «= ONILOGCEU JO DZOINVHOD 


Le) 


& 


(1saL Fe) 43MOd) 


"Zo. =I HOUUR AO MSI ‘ 2-u‘(a)} Ft 


‘eu ‘( 


Ye) 
= 
ols 


(0-4, 0 LVHL sx") A wort SNOIDAY TWORLTYD) dO NOSTHYVdWO) HOLT 


22 Contributions to the Theory of Testing Statistical Hypotheses 


Again, it might be argued that in practice when testing whether o?= op’, it is 
generally more important to avoid accepting Hy when o? > o,” than when 0? < a9”; 
consequently the equal tail area test is preferable to the type A test because: 

(1) It is of greater power with regard to alternatives assuming o” > a)”. 


(2) While of less power with regard to alternatives assuming o? < oy”, its in- 
feriority to the type A test is here of less importance. 

Under the conditions suggested this argument is a reasonable one and em- 
phasises the fact that when a priori information regarding the possible alternatives 
exists, or when what we have termed elsewhere the quality of the errors of second 
kind must be taken into account,* the choice of a test can and should be based on 
a comparative study of the power functions of alternative tests. In very many 
problems, however, no attempt can be made at numerical assessment either of the 
a priori probabilities or of the consequences of different errors, and in such cir- 
cumstances we would ourselves prefer to use an unbiassed test and to select out 
of such tests, if it exists, that which is uniformly-most powerful. 

A final question that may be raised is this: Under what circumstances do we 
employ the uniformly most powerful test based on the best critical region? A 
common situation in practice is one in which it is wished to test whether the 
unknown parameter has a value, say, of @>6@), where 6, is some specified value. 
This is the manufacturer’s problem when he wishes to know whether the variation 
in some quality characteristic, x, has increased above a tolerance limit, say 
6, =o ,{ and ifso to prevent the material being placed on the market. In technical 
terms he wishes to use a sample of m observations to test the composite hypothesis 
Hy, that o? > 0,7, to which the admissible alternatives assume that o?<o,?. In 
using the best critical region of size « for the simple hypothesis hy, that o?=o,?, 
with regard to this set of admissible hypotheses, he knows that if he is in fact 
sampling from a population in which o? > a)”, the chance of the first kind of error 
will be jess than « (its value if hy were true). Thus the risk of his sampling procedure 
failing to detect a batch of material with o? at or above the tolerance level is « 
or less, the value of « being at his choice. 

In this way the uniformly most powerful test which was devised to test the 
simple hypothesis assuming 0= 9), with regard to alternatives assuming 0 < 6), 
can with a slight difference in interpretation be used to test the composite 
hypothesis that assumes @ > 6). 


Example VI. In Example IV we have compared several unbiassed regions the 
boundaries of which were defined by equations ¥=constant; one of these was 
the type A region and its power curve lay everywhere (except at the point ¢=0) 


* See J. Neyman and E. 8. Pearson, “The Testing of Statistical Hypotheses in relation to prob- 
abilities a priori”, Proc. Camb. Phil. Soc. vol. xxix (1933), pp. 497, 502. The quality of the errors 
depends upon the danger of the consequences of accepting H, when different alternatives are true. 

f The manufacturer would not mind if the variation in his product were less than the tolerance 
value. 


J. NEYMAN AND E. S. Pearson 23 


above those of the others. It is of interest to compare the type A region for the 
hypothesis regarding o? discussed in Example V with an unbiassed region deter- 
mined by another measure of variation, namely the range in the sample. The 
sampling distribution of the range, say r, has been determined by McKay for 
normal variation in the case n = 3* in the following form, involving an integral 


2 pcm 
6 -ifeve 1 
oVa 0 V2Q7 


The unbiassed region, w, for testing the hypothesis that « =o, which we shall 
seek, is one defined by inequalities 


p(r)= et dy. (74) 


ff) Rnd 7 27. (75) 
where r, and r, must be chosen to satisfy equations corresponding to (11) and (15), 
namely “ 
P {Bev | o.}= | p(r)dr =o, (76) 
rT; o=05 
dP {Eew|o} d [" 
at vac a =3,|"? (r) dr ae =0. (77) 
Using (74), equation (76) may be written 
ig Tr 
6 ia ae evs 
z 40? dr | ° —ty? J = 
ae rf’ ¥ 2 c= 2 ve 


and since it may be shown that the conditions of regularity are satisfied, we may 
differentiate under the integral in (77) and obtain 


se 
en RE 
a ‘ ‘ar |° etd 
Sah 0 y 


+ 


6 “| 7 r 
mov 2 qT, 20% o2#V6 


in which o is to be put equal to og=1, say. Making this substitution and using 
(78) it is seen that 


i 
T. PR v2 
pape ict [orci ar [ew ay | re” dr =0. (80) 
anv 2 T1 0 us 


%) 


4 ST z 
~ agi (0V8 ae “3a 

e et” dy — e dr=0, (79) 
0 ‘ 


If we apply the formula for integration by parts to the first integral in (80), 
i.e. to the middle term, we find after reduction that this term becomes 


Tv Tr 
6 V6 oe a v6 Voie 
re | et’dy| +—| e*” ar| etwdy+ [=| re dr. 
7 V2 0 7 1 V2 Ty 0 y T Jt 
. (81) 


* L.e. for the case where p (E | o) is given in equation (65). See A. T. McKay and E. §. Pearson, 
“A note on the distribution of range in samples of n”, Biometrika, vol. xxv (1933), p. 417, equa- 
tion (11). The range is the difference between the highest and lowest value of z in the sample. 


24 Contributions to the Theory of Testing Statistical Hypotheses 


The middle term in (81) is equal to «; substituting into (80), it follows that we are 
left with 


pee (82) 


Ty 


: 
rent | lew dy 
0 


Ts 


an ay, 
OL rent? ( V6 ov! dy —T. aoe e~tv* dy = 0. (83) 
0 0 


The unbiassed region of size « based on range, for testing the hypothesis that 
o=0,)=1, is therefore defined by the inequalities (75), where r, and r, must be 
determined to satisfy (78) and (83). 

Computations were carried out in the single case of «=0-02, the following 
method being adopted:* 


(1) A table of the probability integral I (r)= i p(r)dr was obtained using 
J 0 
quadrature. pa 
(2) Using a trial value of r,, «, = J (r,) was found from this table. 


(3) r. was then found so that 1— J (rz) =«—«,=0-02—a,=a,, using backward 
interpolation in the table. 


(4) These values of 7, and 7, were inserted into (83), making the left-hand side 
not zero but, say, equal to J (7,). 


(5) Successive trial values of 7, were taken and the process repeated until 
several values of J (r,) weré obtained lying above and below zero. From these, 
by backward interpolation, it was possible to find the value of r, and the corre- 
sponding r, which made J (r,)=0. These determined the required unbiassed 
critical region of size a= 0-02. 


The values obtained were 17, = 0-250, r,= 4-647, giving for the two tail areas 
& = 0-0171, a= 0-0029, a, + %=0-0200. For the region based on equal tail areas, 
a = % =4a=0-0100, it was found that r,= 0-191, r,=4:120.f These latter limits 
determine a critical region which is, of course, biassed. 

Fig. 5 shows a comparison of the power functions associated with 


(1) The unbiassed type A region based on v= (x,?) discussed in Example V, 
for which the limits v, and v, have been given in ‘i'able I. 


(2) The unbiassed region based on the range, 7, which has just been described. 


The greater power of the type A region is at once evident. It is, however, of 
importance to remember that in using v we have supposed that the population 
mean was known to lie at £=0. If the value of ¢ is unknown then v cannot be used. 


* We must thank Miss C. M. Thompson and Mr M. R. El Shanawany for undertaking the 
calculations. 


t+ See Biometrika, loc. cit. p. 419. 


J. NEYMAN AND E. S. PEArRson 25 


We have been able to show that in this case,* to test the composite hypothesis 
that o=o9, the unbiassed critical region analogous to type A is defined by the 
inequaliti , a ' 

APES gt E (%;—Z)*/o2<v,' and v' =X (2,—7%)2/o,2>0,', (84) 
where Z is the sample mean, p (v’) is of the form (70) with n—1 substituted for n 
and v,’ and v,’ are determined to satisfy the equations (68) and (72), v’ being 
written everywhere for v and n—1 for n. 


FIG.5. COMPARISON OF CRITICAL REGIONS FOR U & RANGE. 
ie H, IS THAT o* =4:0 (n= 3 , O=-02) 


POWER OF TEST. 
6 6 


) 
n~ 


JS TE A ee Cae Se 
oO” IN SAMPLED POPULATION. 

The power function for the test based on v’ for n= 3 is therefore identical with 
that based on v for n = 2, dealt with in Example V and drawn in Fig. 4(b). If this 
curve were added to Fig. 5 it would be found almost impossible to distinguish it 
from that for the test based on range, although it does everywhere (except at 
o=0)= 1) lie slightly above it. It follows that for samples of 3 the great increase 
in power of the v-test over the r-test lies in the fact that the former is making use 
of the known value of the population mean while the latter is not. If this mean is 
not known, then the v’-test, i.e. that based on sample standard deviation, is only 
slightly more powerful than the r-test based on range. As the sample size n 
increases, the standard deviation test will of course become steadily more and 
more powerful than the range test. It should be noted that for n = 2 the tests are 
identical. 


* The theory of unbiassed critical regions in the case of a composite hypothesis which are 
termed “type B” has been dealt with by Neyman in a paper actually in print in the Bulletin de la 
Société Mathématique de France (1936), entitled ‘‘Sur la vérification des hypothéses statistiques 


composées’’. 


26 Contributions to the Theory of Testing Statistical Hypotheses 


7. The invariance of an unbiassed region of type A with regard to transformations 
of the unknown parameter. 

The method that we have used in selecting a critical region from among un- 
biassed regions has been based upon a consideration of the derivatives of the 
power function, P {Hew | 6}, with regard to 6 at 6=6). It is of some importance to 
show that the region selected will not be modified were a function of @ to play the 
réle of 6 in the solution given. In Example V, for instance, we have taken 6=o; 
it is natural to ask whether the solution would have differed had we taken @ = 0°. 
Again, in certain situations it may happen that the importance of detecting a 
departure of @ from the hypothetical value, 6,, may depend upon the sign of the 
difference A = 6 — 6,. Thus, for instance, in the case of Example IV, while the test 
selected is of equal power with regard to alternative hypotheses assuming 
6,=0,—A and 0,=6,+A, it might be considered desirable to modify the test so 
that it was of equal power for, say, 6,=8),—A and @,=6,+ 2A. Could this change 
in the scale of importance be effected by substituting for 6 an appropriate function 
of @ in the solution? 

If the type A test obtained with regard to @ is also what we have termed the 
uniformly most powerful unbiassed test (see p. 9), no useful modification in scale 
could be effected without making the test biassed. However, pending a general 
discussion of the relation between the type A and the uniformly most powerful 
unbiassed test, it is of importance to prove the following proposition. 

It will be seen that a change in the scale of importance of the form suggested is 
equivalent to consideration of the set of admissible hypotheses, specifying values 
of a new parameter, A, connected with @ by means of the relation 


O=6)+f(A)=$ (A), (85) 


where f(A) is some specified increasing function of A such that f(0)=0 and 
consequently (0) = 6). The hypothesis tested is that A= A, =0. 


Proposition III. If the monotonically increasing function 4(A) admits of 
three consecutive derivatives, so that the probability law p(H|¥(A)) satisfies 
the conditions of regularity (17) and (18), where the rdle of the parameter 6 is 
played by A, and if d/dA|,_,>0, then the region w, which is the unbiassed 
critical region of type A obtained with 6 as the parameter specified by the admis- 
sible hypotheses, will have the property of being an unbiassed critical region of 
type A upon the assumption that the parameter specified by the admissible 
hypotheses is A. 

In order to prove this proposition it will be sufficient to show that any region 
w having the properties (19), (20), (21) and (22) will satisfy similar conditions 
when, instead of considering @ as the independent parameter, we consider A as 
such, where 


P=H(A), WO=%, GR =W'#0, Tas] =8" (ay) (86) 


J. NEYMAN AND E. S. Pearson 27 
Upon these assumptions we shall have 
Pa’ (E | $(0)) = p96" (Z| %) ¥', (87) 
Da" (|b (0)) = 9" (B | 8) p'? + 6" (E | A) p". (88) 
Solving (87) and (88) with regard to pq’ (E'| 09) and pg” (E | 6), substituting into 
(19), (20) and (22) and multiplying (22) by ’2>0, we find that the following 
inequality must hold within the region, w, 


Py" (EB | 4 (0)) > ky'py' (Z| (0)) + ke'p (Z| $ (0), (89) 

while outside w the reverse must hold, where 
ky = (ky pth )/p', ky = hyp” (90) 
and fe Pa (E | %(0)) dx,...dxz,=0. (91) 


Since the condition (21) is not affected by the change in the parameter specified 
by the admissible hypotheses, the proof of the proposition is completed. 

It follows that in so far as we are dealing with unbiassed critical regions of 
type A and testing a simple hypothesis regarding a single parameter, no change 
in the test will result from a transformation of the parameter. 


8. The determination of an unbiassed critical region of type A,. 

The unbiassed critical region of type A has been determined so that the power 
function satisfies certain conditions at =6,, but as pointed out in Section 3 the 
practical value of a critical region will depend upon the properties of the power 
function throughout the whole range of admissible values of 0. 

We shall say that a region w, is an unbiassed critical region of type A, and 
of size « if it has the following properties: 


(a) P{Hewy| $=, (92) 
(o) Perel no. 88) 
and (c) P{Hew,|0,}>P{Hew| 6,} (94) 


for every other region w satisfying (92) and (93) and for every admis- 
sible value 6, + 4). 


Since equations (92) and (93) are identical with (11) and (15), the specification 
of the type A, region differs from that of the type A region in the substitution of 
condition (94) in place of (16). It is felt intuitively and may be rigorously proved 
that if regions of type A, and type A both exist they must be identical, but even 
if the elementary probability law p (E | 6) is regular and admits of differentiation 
under the integral representing P {Hew | 6}, type A, may not exist. If, however, 
this region does exist, it possesses the important property that its power with 
regard to any alternative hypothesis will exceed that of any other unbiassed 
region of size « satisfying (93). 


28 Contributions to the Theory of Testing Statistical Hypotheses 


It is also probable that in very many cases the test of H, based on a type A; 
region will be the uniformly most powerful unbiassed test in the sense defined on 
p. 9 above, i.e. will satisfy condition (14). It must be remembered, however, 
that the condition (94) is not identical with (14); in the former the alternative 
regions wy satisfy (92) and (93) (i.e. (11) and (15)), while in the latter they satisfy 
(11) and (12). It may be possible (in fact one example of this kind is known) for 
an unbiassed region, say w,’, of size « to exist for which the derivative of 
P {Eew,’ | 6} does not exist at 6=6), but which is nevertheless uniformly more 
powerful than the type A, region, wy. The situation is suggested in the following 
diagram, where w,’ is described as of type Ag. 


POWER CURVE FOR REGIONS OF TYPE A,(wo). 
AND TYPE A, (wo). 


i 
a 
SI 
Fa 
e) 
eo 
iS 
e 
S 
By 


SCALE OF 6 
Fig. 6. 


The relation between these regions clearly needs fuller discussion and illustra- 
tion. Here it will suffice to indicate the method by which the unbiassed critical 
region of type Aj, if it exists, may be determined. 

We shall assume that for any region w the first derivative of the power function 
given in (17) may be represented by the integral of (18). Then using the Lemma 
of p. 10 it is not difficult to show that the region, say wp, (6,), which satisfies (92), 
(93) and (94) for a fixed alternative value, @=60,, is such that 
within w (6;) P(E | 6,) > ky pg’ (E | 69) + kop (E | 99) _ (95) 
and outside w(@;) p(B | 6;)<ky pg’ (EB | 69) + kyp (E | 4), (96) 
where &, and k, are two constants to be determined so as to satisfy (92) and (93). 
If the region w)(8,) so obtained is the same for every admissible alternative 


6,46), that is to say, is independent of 0, then it will be an unbiassed critical 
region of type A,. 


J. NEYMAN AND E. S. Pearson 29 


Example Va. We shall now show that the unbiassed critical region of type A 
found in Example V when testing the hypothesis that o= Gy has the property of 
being also of type A,. Using the probability law for the 2’s given in (65), it is 
found that the inequality (95) may be written 

= (2,7) 


ae é 20,7 > 20,” k Sa a k ai 
& 4 (= a ¥ ( tty O° 7 en) 


0 


where k, and k, are certain functions of o, to be chosen so that w, (c,) satisfies 
(92) and (93). Writing as before & (7,”) = vo9? and rearranging, the inequality (97) 


may be written ec? — et (0,2—0,2) vio,2 >a+ bv, (98) 


where a and 6 are functions of k, and k, as well as of o,. It follows that the region 
Wo (a,) is limited by surfaces on which v has certain constant values, and that 
instead of finding k, and k, in (97) so as to satisfy (92) and (93), we may determine 
values for a and b in (98) so as to satisfy the same conditions. We shall now show 
that the inequality (98) may be made equivalent to two inequalities 
V<v, and v2, (99) 
where v, and v, have the values found previously in Example V, and therefore 
satisfy (92) and (93), which in this special case assume the form (68) and (72). 
If we substitute these two values of v alternatively into both parts of (98) and 
join them with the sign of equality, we obtain 
eu=atbr,, e%=at+br. (100) 
If the values of a and 6 obtained by solving (100) are substituted into (98), it 
may be shown that an inequality is obtained which is satisfied by values of v 
which satisfy (99) and by these only. To prove this consider the function, say 
y (v) =e? —a—bv. ‘ (101) 
We shall have y(v,)=y(v_), which shows that the function y(v) must have at 
least one extremum between v, and v,. Taking the derivatives 
y’ (v)=ce’—b, y’’ (v)=c?e">0, (102) 
we see that y (v) has only one extremum and that it is a minimum. It follows that 
y(v)>0 for v<v, and ala (103) 
and y(v) <0 for v,<v<,, 
which proves that (98) is equivalent to (99) irrespective of the value of c. Further, 
the critical region determined by (99) satisfies the conditions (92) and (93) and 
being independent of o, has the property of being an unbiassed critical region of 
type A,. 
Example IV a. The proof that the unbiassed critical region of type A found in 
Example IV, in testing the hypothesis concerning the value of the mean, is of 
type A, may be carried out in a precisely similar manner. 


30 Contributions to the Theory of Testing Statistical Hypotheses 


9. Summary. 

After a brief review of the approach to the problem of testing statistical 
hypotheses developed in a number of earlier publications, we have considered 
methods which may be followed in choosing a critical region for testing a hypo- 
thesis Hy when no uniformly most powerful test exists. While the conceptions 
introduced may be extended to cases where the hypothesis tested is composite 
and concerns more than one unknown parameter, we have confined attention in 
the present paper to the case of a simple hypothesis regarding a single unknown 
parameter. 

We have suggested that the practical value of any critical region, w, depends 
on the properties of its power function throughout the range of admissible values 
of 6, and that a comparison of alternative regions of the same size may be readily 
effected by a comparison of their power functions. We have defined an unbiassed 
critical region. 

In very many problems there appear to be strong intuitive grounds for con- 
fining the choice of a critical region to those which are unbiassed; if this principle 
is accepted, then it would appear best to choose from among such regions, if it 
exists, that giving the test which is uniformly more powerful than any other. 

The determination of the uniformly most powerful unbiassed test presents to 
a direct attack, however, certain mathematical difficulties. We have therefore 
approached the problein from another direction, defining the unbiassed critical 
region of type A, and showing how this may be determined. The conditions of 
regularity which must be satisfied by the probability function p (Z| 6), upon 
which the solution given depends, have been discussed in some detail in an 
Appendix and the solution has been illustrated on several examples involving 
statistical tests in common use. 

It has been shown that under certain conditions the critical region of type A 
will be bounded by surfaces upon which a sufficient statistic, 7’, of 6 is constant. 

Finally, the conception of a critical region of type A, has been introduced and 
a method given for determining it, if it exists. The unbiassed critical regions of 
type A obtained for the test of the simple hypothesis assuming 0 = 6), concerning 
(i) the mean (@= €) and (i) the standard deviation (6 =<) of a normal population, 
have been shown to be of type A,; in other words, the tests associated with these 
regions are uniformly more powerful than other unbiassed tests of the same size 
for which dP {Kew | 6};d@ vanishes at 0=6,. 


J. NEYMAN AND E. S. Pearson 81 


APPENDICES 


APPENDIX I 
1. The conditions of regularity. 


It seems important to specify in some detail the meaning of the conditions 
relating equations (17) to (18); under which the solution of the problem of un- 
biassed critical regions contained in (19) and (20) is reached. These conditions 
concern what may be called the regularity of the function p(E | 6), and among 
other things are satisfied: 


(a) Ifin any point, Z, of the sample space and for 6, — Ay < 9 < 6+ Ag, say, the 
derivatives pg’ (E'| 6), pg” (EZ | 6) and py’” (£ | 6) exist and are integrable over any 
region contained in the same space. 


(6) If there exists a non-negative function, f(2,,...x,), of the x’s independent 
of @ such that the integral 


I,={...[ J (ay -<<,,) dx, ...dt, (104) 
(7) 


n 
extending over the region & (7,—a)*>r*, a and r being some fixed numbers, 
i=1 


exists; and further if it is possible to find such a number M/ > 0 that 


| pe" (Z| 6) |< (105) 
in any point FH for which E(w —a)* <7, 
and | pe’ (Z| 9) | <f (&1, --- Fn) (106) 
for any E such that PAC —a)*>r?, | 


whatever be the value of @ within the limits 0, —Ay<@<6)+ Ay. 

It is easily seen that if the conditions (a) and (6) are satisfied, then the integrals 
(18) do represent the derivatives (17). In fact, under conditions (a) and (b) we 
may write 


Aé)2 
p(B | 0+ A8)=p(E | 8) + AO py’ (| &) + oe 


sate! (E | 4) 
(46)? 
3! 
for any sample point £ and for | A@| <A), where 0< <1. Take now any region 

w in the sample space and denote by w’ and w”’ the parts of this in which 


+ pe” (EB | + BAG) (107) 


n nr, 
Xx (a,-a)?<7r? and 2% (x;—a)*>r? 
1 i=1 


i= 


32 Contributions to the Theory of Testing Statistical Hypotheses 
respectively. We shall have 


aol [=| PEt Al) de day |...) UB | 6) d...dz,| 


Aé e 
={-[ pl (B | 8) diy dig + 5 || pe (| 69) dx,...dxy, 
w 3 w 


(A@)? a 
+ veel t]ee-| po’ (| 85+ BAO) da, ...dx,}+. (108) 
3! w Ww” 
Now the function pg’ (Z|) is according to our assumption integrable, and 
therefore the second term on the right-hand side of (108) tends to zero as AO 0. 
Similarly the third term will tend to zero with A@, since the absolute value of the 
expression in brackets or 


ff pe (B | 65+ BA0) de, ... dar, 


and is bounded, V (r) being the content of the hypersphere of radius r. It follows 
that as A@-> 0, the right-hand side of (108) tends to 


ie! pg (EL | 6) dx,...dz,, 


which, owing to the form of the left-hand side, represents the derivative 
dP { Hew | 6} 
dé 

Similarly, we could prove that the second integral in (18) represents the second 
of the derivatives in (17). 

These are, of course, sufficient conditions for the validity of our solution. It 
will, however, be seen below that the existence of the derivative py’ (H| 0) in 
every point of the sample space is a necessary condition; if it is not satisfied then, 
at least in certain cases, the form of the solution we have reached is not valid. 


< MV (r)+1(r) (109) 


6=6 


2. Illustration of investigation into the conditions of regularity. 

Example IV a. This example has already been discussed on pp. 15-18 above. 
It is, however, of interest to show in detail that the conditions (a) and (6) of p. 31 
above, which make the solution a valid one, are satisfied. In the first place it will 
be seen that whatever be the sample point ZL, the first three derivatives with 
regard to € of the function p (£ | €) defined in equation (55) exist: 

pe (EB | €)=p(E| €) B(x; —&) 
pe (EH | £)=p(B| €) {2 (a,—€)?—n} » (110) 
pe" (E\ £)=p(B| €) [2 (@, — £) 8 — 8nd (x, —£)} 
These expressions are integrable. We must now show that the third derivative 
satisfies condition (b). For this purpose, (i) fix a number A, > 0, (ii) fixa=€,, and 


ee X (2-9)? = 1°, (111) 


J. NEYMAN AND E. S. Pearson 33 


We shall now need the maximum value of | & (x;—&,) | subject to the condition 
(111). This can be obtained by applying the method of undetermined multipliers 
of Lagrange, and by maximizing {5 (x;— & )}?. The method consists in differ- 
entiating, 

=“ F={E (x,—&)}? + aE («,—&)?, (112) 


where « is a parameter to be determined later so as to satisfy the condition (111), 
and in equating the derivatives to zero. We have 


oF 

ee (%;— & ) + 2a (x; — &) =0, (113) 
whence %,=f)—n(%—£,)/a, (114) 
and it is seen that the maximum is attained when all the z’s have the same value. 
Since this must also be the value of Z it follows that «= —n. It is seen from (111) 


ae D (a,— &)?=n (E—Z)?="9, (115) 
whence — Z=f,+n-+*r, (116) 
and finaily = (z,—£)) =n (E—&)) = + Var. (117) 

It follows that for any system of 2’s satisfying (111) 
—Vnr <d (x;—&)< +Vnr. (118) 

Now whatever be € within the limits £;— A) <&<&)+Ap, 

|B (x,—€)|<|Z(aj—£) | +2 | f—E| <V nr +o, (119) 
and therefore |X (x, €) |>< (War +nA,)°. (120) 


Further, x (x, —€)? = (2, — £5)? + 2 (€)—€) U (4, —&)) +2 (& — §)? 
>r?—2| &5—€|Vnr+n (E—€)? 


>r?— 2A, Var =r (r—2A, V7), (121) 
and if r>r,=4A,Vn, then r—2A,WVn> }r and 
xX (2;—&)? > $7. . (122) 


We may now use (119), (120) and (121) to determine a function f (Z) independent 
of which exceeds | p,’” (E | €) | in any point of the sample space in which 


v2 =D (%;,— £5)? > 197 = 16A,2n (123) 
and is integrable. We notice first that owing to (122) 
1 nm 
E <(=) ee (124) 
p(B |£)<(F 


and next that 
| [2 (2,—-€)P— 3nd (x;—€)|<| x (2,—£)* | +3n |X (a,—£)| 
< (Var + nAg)? + 3n (Vnr+nA). (125) 
Thus f(#) may be defined as 


f(£f)= (Ta) ee Vn + nt) + 30 (Vmr +nAv)}. (126) 


34 Contributions to the Theory of Testing Statistical Hypotheses 


It is easily seen that for ¥ (x; —£)?> 102, f (Z) cannot be less than | p¢’" (HZ | €) | if 
Ey—Ap<&<£y+Ag, and that this function is integrable over the ae sample 


space. Its integral 
le f(#)dx,... dx, (127) 


is found by introducing a new series of variables, including r, of the form used in 
deducing the probability law of the sum of squares of n independent normal 
variables, and is expressible in terms of Gamma functions. On the other hand 
it can be seen that within the region in which Z (x;—& )?<7o" the derivative 
pe'”' (E | €) is bounded whatever the value of € within the assigned limits, namely 


| pe” (BIO 1<(— -)’ {(Vnrg tnd)? +3(VnrgtnAo)} =U (say), (128) 


and it follows that the condition (6) on p. 31 is satisfied. 


Example VII. We shall now describe an example in which the solution of the 
problem of unbiassed critical regions as given by the inequalities (19) and (20) 
is not valid. Suppose that the admissible hypotheses assume that the probability 
law of the 2’s is of the form 


p(E| 9) =p (x1, ...%,y | ¢)=e24™ (129) 
in any point, FH, of the region, say W, (8) in which x, > 0 (i=1, 2, ... n), and that 
p(#| @)=0 (130) 


in the remaining part of the sample space, say W, (9), where @ is any real number. 
The hypothesis tested, Hy, assumes that 0=6,. In order to test whether the 
conditions for the validity of the solution are satisfied, take first a point #, in 
which x;>6) (t=1, 2, ... n). It is easily seen that at this point the first three 
derivatives with regard to 6 exist; their values are 


(E,, | 09) =np (E., | 9), shies E.,, | 9) = (B., | % ‘ 
pe (E., | 0) =n*p (LE, | sh 
Similarly, the derivatives of p (£ | @) with regard to @ exist in any point EF, of 
W, (4) and are all equal to zero. Consider now a point H’ lying on the frontier of 
W , (8); it will be seen that p’(H| 6) does not exist at this point. In fact, if 
Aé@>0, then the point H’ will be within the region W,(6,+Aé@) and also within 
the region W, (8) —A@), and therefore we shall have 
p(B’ |0)+A0)=0, p(E’ | 8) —Ad) = e-2@i-Fo+ 4) (132) 
and Pie (E’ | 69+ Ad) =0, mea | 0g — Ad) = e-2 i), (133) 
—> Ae—> 


(131) 


It follows that p (E’ | @) considered as a function of @ is not even continuous at 
the point 6=6,. Thus the derivative pg’ (E’ | 6) does not exist and the validity 
conditions for the solution given above are not satisfied. As a result, for certain 
regions w in the sample space, the integrals (18) do not represent the derivatives 


J. NEYMAN AND E. S. Pearson 85 


(17). It is easily seen that such will be regions including points belonging both to 
W,(6) and W,, (6). 


To make the position quite clear, we may consider the case when n= 2. Take 
for the region w that in which 


x, +%,<a=constant. (134) 


Assume that # <a and calculate the probability of the sample point falling within 
w. This is given by 


P{Eew | 0}= | jee | OO tay 8) dey} de, 
8 8 ; 
= 1—e-a-) _ (4 — 8) ea-2), (135) 
The derivative of this probability with regard to @ <a exists, namely 
d 
Thi {Eew | 6} =e“) + {1 — 2 (a— 6)} ea-2), (136) 


but it is not represented by the integral of p,’ (Z| @) taken over the region w. 
In fact, using (131) and (135) we find © 


| fee ends, dx p=2..-[ 9( (E | 0) da, ... 


=2(1—e-2)_ (a- pean (137) 


whichis different from (136). As our solution for the unbiassed region is essentially 
based on the assumption that the derivative in (136) is identical with the integral 
in (137), it follows that it does not apply to the present example.* 


APPENDIX II 


Calculation of the limits of the type A unbiassed critical region for v. 
As shown in Example V, this region, of size «, is defined by the inequalities 


v<v, and <2, (138) 
1 
where i p(v)dv=a,, (139) 
[2 @do~ay, (140) 
Ug 
Ay + % =O, (141) 
v3” e731 _ v_*” e—2%2 = A = 0, (142) 


* It may be noticed that the fact that an integral of a derivative does not necessarily represent 
the derivative of the integral, which seems often to have been overlooked by statisticians, is the 
cause of many misunderstandings surrounding the conception of the so-called “amount of in- 
formation”? as introduced by R. A. Fisher. 


3-2 


36 Contributions to the Theory of Testing Statistical Hypotheses 


and the integrals in (139) and (140) may be obtained by entering the Tables of 
the Incomplete Gamma Function* with u=v/V 2n, p=} (n—2). Two methods of 
obtaining v, and v, for a given « were employed in preparing the figures used in 
connection with Example V. 


Method I. (1) Choose a trial value of v, (perhaps the equal tail area limit given 
by a, =a, = 4a) and by forward interpolation in the Tables determine «. 

(2) Hence obtain «,=«—a, and by backward interpolation in the Tables 
determine v,. 

(3) Insert these values of v, and v, in the expression on the left-hand side of 
equation (142); this will not vanish as it should had the correct limits been used, 
but will equal some quantity A #0. 


(4) Repeat the steps (1), (2) and (3), starting with further trial values of v, 
until values of A both positive and negative have been obtained. A practised 
cdmputer is soon able to find, say, four values of 2 at equal intervals leading to 
two positive and two negative values of A, sufficiently close to zero for backward 
interpolation to determine the value of v, and hence, through steps (1) and (2), 
of v, which make A =0. 


These will be the limits required in (138). 


Method II. This consists in obtaining, by the procedure described below, pairs 
of values (v,, v2) satisfying (142), i.e. making A =0; then finding from the Gamma 
Function Tables the corresponding a, and a. The sum «’ = a, + %» will not now be 
precisely equal to «, but by a succession of trials similar to those followed using 
the first method it is possible to determine the pair (v,, vg) leading to «’=«. To 
calculate values of v, and v, satisfying (142), we note that this equation may be 
eripien V2 = 0, —Nlog v, +n log v.. (143) 
Now choose a trial value v;=2 and find a corresponding value v,=y, so that 
& + %=o as in (1) and (2) of Method I. Using y as a first approximation to the v, 
which with v,=2, satisfies (143), find successive approximations yp, y3, ... etc. 
as follows: 

Yg=x,—nloga,+nlogy, 


Y3=%,—nlogx,+nlogy, |. (144) 


Yj, = © — Nlog x, + nlog y,_4 
Since the derivative with regard to v, of the right-hand side of (143) is positive 
and less than 1, this process leads very rapidly to a limiting value v,=y, say, 


* See footnote to p. 6 above. In the notation of these Tables 


v, _n—-2 Vo n—2 
a =I ( 1"), =1-1( 2, °>). 
ae nar ie Van" 2 


J. NEYMAN AND E. S. PEarson 37 


which with v,=2, satisfies (143) This pair of values when substituted into (139) 
and (140) will not, of course, give «, and «, satisfying (141), so that further trials 
will need to be made to obtain «,+a,=«. 

On the work carried out it is difficult to judge which method of solution is the 
quicker. It should be noted, however, that Method I requires an exact backward 
interpolation in the Gamma Function Tables at every trial, while in Method IT 
only a rough backward interpolation is needed when z, has been selected to obtain 
the first approximation, y,. Since forward is always easier than backward inter- 
polation in a case where 2nd and 4th differences must be used, Method II is 
probably the speedier. 


AN INVESTIGATION INTO THE APPLICATION OF 
NEYMAN AND PEARSON’S L, TEST, WITH TABLES 
OF PERCENTAGE LIMITS 


By P. P. N. NAYER 


CONTENTS 


PAGE 
1. Scope of the investigation . ; 38 
2. The adequacy of the Type I epproximation &S p (Ly) (ease n N eodgeinnks a 40 
3. Methods of approximation . : A : : ‘ 40 
4. Special case of 2 samples (k=2) . : ‘ : : : : : : 43 
5. Computation of Tables IV and V. : : : : : 44 
6. The limiting form of p (Z,) when the values at a are ices ; : : ‘ 46 
7. Case of samples of unequal size (n; not constant) v ei os : * : : 47 
8. Illustrative examples . : ; . ; : - : E : 48 
9. Summary . é c : : 3 “ ; : : - : : 50 


1. Scope of the investigation. 

The statistical problem considered in the present paper is that in which the 
observations of a variable quantity 2 fall into a number, say k, groups. If the 
letter t is used to indicate the group and i the particular observation in the group, 
then 2, will be the ith observation in the tth group. The following notation may 
ge nee: n,= number of observations in tth group. 


Kk 
N= % (n,)=total number of observations. 
t=1 
= = (x,)/m= mean of group. 


—_ 
8? = X (%,—%,)?/n,= variance of group. 


IMs v 


v 

In their paper entitled “On the Problem of k Samples’’* J. Neyman and E. S. 

Pearson have supposed that the observations in the tth sample or group have 

been drawn from a normal distribution with mean «, and standard deviation o,. 

They then consider how the data may be used to test certain hypotheses regarding 

a, and o,. In my discussion below I shall be concerned only with the test of the 
hypothesis that they have called H,, namely that : 


0, =0,=...=0,=6. (1) 
This is the test that these k independent samples have been drawn from popula- 
tions having a common standard deviation, o, it being assumed that the popula- 


* Bulletin de V Académie Polonaise des Sciences et des Lettres, Série A (1931), pp. 460-481. 


P. P. N. Naver 39 


tions are normal. The use of the method of likelihood followed by these authors* 
has led them to suggest the following criterion for testing H,, 


Ny 1 
ee Bea one 
= ~+— = mT}, (2) 
N (m8?) Xd (%;—%,)? ae 
t ti 

an expression which is seen to be the ratio of the weighted geometric mean of the 
sample variances to their weighted arithmetic mean. As L, decreases from 1 
towards 0, the investigator will be less and less inclined to accept the hypothesis 
H,. If H, is true, then in repeated sampling the gth moment coefficient of Ly 
about zero is 


N-k qn m—-1 gn 
PCS) lan C8) ; 
a et eee era (3) 

r( 5 +q r( 5 


In the special case where the number of observations in each sample is the same, 
i.e. n,=constant =n, and N =nk, it will be seen that L, may be written 
1 1 
I (s?)* [I {a (24, — %)?}* 
L,= 3 sa ie Ratha (4) 


1 = 
fa (s;") A 2 2 (4; — 2)? 


ee, = k 
1S PEs 
‘— ig 


2 ee N—k n=l 
Cees ci) eleran 


(5) 


and 


Neyman and Pearson have given reasons for expecting that the sampling 
distribution of L, (if H, be true) may be represented approximately by a Type I 


curve ri ) 
m,+m 
PE) oh) 


where the constants m, and m, are chosen so that the first two moment coeffi- 
cients of (6) agree with the true values obtained from (3) or (5). 

My object in the present paper is (a) to investigate in some detail the adequacy 
of this approximation; (b) to provide tables of 5 per cent. and 1 per cent. prob- 
ability levels for L, in the case where n, is constant (i.e. for the L, of (4)); (c) to 
consider how far in the case where n, is not constant the probability levels for the 
L, of (2) might be obtained from my tables, entering them with k and n=N/k, 
the average sample size. 


Lym (1-L,)", (6) 


* Loc. cit. and Phil. Trans. Roy. Soc. A, vol. coxxxi (1933), p. 295. 


40 Neyman and Pearson’s L, Test 


It should be mentioned that tables of the 5 per cent. and 1 per cent. probability 
levels for L, (and also for Neyman and Pearson’s criterion L) have been published 
by P. C. Mahalanobis,* but no investigation of the form referred to under (a) 
above appears to have been carried out, nor are the argument intervals in these 
tables always convenient for interpolation. 

It will be seen from the Note by B. L. Welch following this paper, that the tables 
of probability levels may be used in more general problems than that considered, 
where L, is based on sums of squares of deviations from regression surfaces. 


2. The adequacy of the Type I approximation to p (L,) (case n,= constant =n). 

Neyman and Pearson’s suggestion that equation (6) might be used to represent 
the probability distribution of L, arose partly from the following considerations: 

(a) L, must le between 0 and 1. 

(b) The likelihood criterion L, appropriate to test the hypothesis H, that the 
means «, in the populations sampled are the same, has a sampling distributiont 

p (L,)=constant B,?-")-1 (1 — L,)##-0-1, (7) 
i.e. exactly of this form. 

Until the true form of p (L,) has een found it is, of course, impossible to give 
a final opinion on the adequacy of any method of approximation. The investiga- 
tion undertaken has consisted in comparing a number of different approximations 
with that of equation (6); there were a prior: grounds for believing that, in certain 
cases, some of these alternatives would be better approximations than (6). 
Since, however, to the order of accuracy aimed at in the tables, no difference was 
in fact found between them, there seems increased justification for confidence in 
the adequacy of the method adopted. 

Another line of attack has been to compare the probability levels obtained 
from (6) with those obtained from the true distribution of Z, in the case k=2, 
where the problem can be completely solved, Again very satisfactory agreement 
was found except in the case of very small samples (n= 2, 3 or 4). 


3. Methods of approximation. 
(i) The standard method adopted is to use equation (6), calculating m, and m, 
from the relations 
My = py’ (Hy — Be')/(He’ — By"), My = (1 — py’) (My — He") /(M2’ — Hy'2), (8) 
where p,’ and pu,’ are obtained from (5), ie. 


m—1 1\)\* nm—1 2\)* 
, 2k r( 2 +3) fil 4k? r 2 +3) 
Pa “Nk 3 = 1 ts = (VSR (NSE LO) ifs 
2 9 ] 
This may be described as method (i). (9) 


* Sankhyd, the Indian Journal of Statistics, vol. 1, part 1 (1933), pp. 109-122. 
+ L,=1—E?, where E? is the squared correlation ratio in the sample. 


P. P. N. Naver 41 


(ii) Unless n is small the distribution of L, is essentially skew with an abrupt 
finish at L,=1. It is therefore likely that a better approximation could be 
obtained by fitting the Pearson Type I curve in the form 

p (L,) =constant x (L,— 6)" (1 — L,)™-, (10) 
obtaining m, , m, and the “‘start”’, b, by making the first three moment coefficients 
of (10) agree with the correct j1,’, uw.’ and yz,’ obtained from (5).* This may be 
termed method (ii). 

(ui) The expression L, is related to Neyman and Pearson’s fundamental likeli- 


2 
hood ratio A, in the form L,=A,4 . So far it has been supposed that the best result 
will be obtained by representing L, by a Type I curve; it would, however, be 
2 


possible to use another form of the criterion, say, L,’=A,!. The sampling moment 
coefficients of L,’ may be obtained from those of L, by writing in equation (5) 
nq for g, with the following result :} 


n°) (r(tst 


(7! 2 2 k 
8 Na alee: cay est Caer (11) 
Ree er) 
A curve p (L,')=constant x (L,')™-1 (1 — L,')™-4 (12) 


may now be fitted to the distribution of L,’ precisely as it was fitted to that of L, 
in method (i), obtaining p,’ and p,’ from (11) instead of (5). This may be termed 
method (iii). 

(iv) For the cases considered, the sampling distributions of L,’ were more 
nearly symmetrical than those for L, , the terminals at 0 and 1 being well removed 
from the mean. For example, figures such as those in Table I may be obtained 
from equations (5) and (11). . 


TABLE I. Mean and standard deviations for L, and Ly’. 


Standard 
deviation deviation 


0433 -1641 
0152 -0950 


It therefore seemed likely that a better approximation to p(L,’) might be 
obtained if a Pearson Type I curve were fitted with the correct first four moment 


* The necessary equations for solution have been given in many places. For example see 
W. P. Elderton, Frequency Curves and Correlation (1927), p. 121. 

+ This is the process analogous to that by which Neyman and Pearson obtain the moment 
coefficients of L, from those of \,. See “On the problem of & samples”, loc. cit. p. 476. 


42 Neyman and Pearson’s L, Test 
coefficients, rather than the correct terminals and first two moment coefficients. 
Such a curve takes the form* 
p (L,')=constant x (L,' —a,)™7-! (a, — Ly’). (13) 

This is method (iv). 

These methods of approximation were applied in two cases: 

(1) n=20, k=30: (2) n=10, k=10. 

The equations of the eight curves were obtained, ordinates calculated at equal 

intervals and the position of the lower 5 per cent. and 1 per cent. probability levels 


obtained by quadrature and backward interpolation. The levels obtained for 
L,', using methods (iii) and (iv), were then expressed as levels for L,, using the 
1 


relation L,=(L,')". The results are compared in Table II. 


TABLE II. Comparison of probability levels for L, 
obtained by various methods of approximation. 


Method of fitting 
Case Level (i) (ii) (iii) (iv) 
; (2 moment, | (3 moment, | (2 moment, | (4 moment, | 
to L,) to L,) | to Z,’) to L,’) 
n=20 5% -9268 9268 9266 9268 | 
k=30 GE 9152 -9152 -9146 -9158 
8220 +8209 8219 
-7786 *7743 -7792 


The most accurate results might be expected to follow from methods (ii) and 
(iv); it will be seen that corresponding limits for L, do not differ by more than 
-0006 in these two cases. Further, the largest difference between the L, values of 
~ methods (i) and (ii) is -0004, and between those of methods (i) and (iv) is -0007. 
Method (iii) differs more noticeably from the others, but since the curve is more 
nearly symmetrical with long drawn-out tails, we might expect the 2-moment 
fixed-terminal method to be less satisfactory than the 4-moment method, (iv). 

Since the tables of probability levels at the end of this paper have been given to 
three decimal place accuracy only, the differences between methods (i), (iii) and 
(iv) are unimportant, and there seemed good reasons for using the simplest 
method, namely method (i), as suggested originally by Neyman and Pearson. 

The diagram shows a comparison of the relevant portions of curves obtained 
by methods (i) and (iii), and of the ordinates at the 5 per cent. and 1 per cent. 
levels. It would have been impossible tu draw. as distinct from method (i), the 
curves of methods (ii) and ‘iv). 


* The method ot fitting is that ordinarily employed, see for example W. P. Elderton, loc. cit. 
p. 54. 


P. P. N. Naver 43 
4. Special case of 2 samples (k = 2). 
In this case the exact distribution of L, can be obtained if we notice that 


"Bice ea ee ae (14) 
2 (817 + 85") 
where u=8,?/(s,2+ 8,7). It is known that* 


PROBABILITY CURVES FoR L -}* 
CASE (i) nm=20 Kk =-30 
CASE (2) n=10 Kk - 10 
— Two MOMENT CURVE FITTED TO a 
« » OA 


AREA 
REPRESENTING 


64+ 66 68S 70 72 74 76 
SCALE OF L. (CASE. 2.) 


Making the necessary transformation, using (14) and remembering that pairs 
of values, w=u, and u=1— Up, give identical values to L,?, we find that 


p (L,2)=— 2 (L991 2,74 (16) 
Tee) 
or ee Lae (17) 


* See, for example, E. S. Pearson, “Analysis of variance in cases of non-normal variation”’, 
Biometrika, vol. xxi (1931), pp. 129-130. 


44 Neyman and Pearson’s L, Test 


It follows that the distribution of L,? is a Type I curve whose probability 
integral can be obtained from the Tables of the Incomplete Beta Function.* 

The method of approximation (i) of the preceding section has been-to use the 
curve of equation (6), that is, to assume that L, follows the type I law and not 
L,?. Table III contains a comparison between: 

(a) True probability levels obtained from (16) by backward interpolation in 
the Incomplete Beta Function Tables. These values will be for L,?, but their 
square roots are the true probability levels for L,. 

(6) Approximate levels obtained by applying quadrature to ordinates of curves 
of equation (6) fitted by method (i). 


TABLE III. True and approximate probability levels for L, (k= 2). 


From true distribution From approximate distribution 


5% 1% 


-0785 -0157 


3122 1411 
-4780 2843 
-7984 +6783 
-9015 8360 
-9612 9339 


The agreement clearly becomes closer and closer as the sample size increases. 

While the limits given in Tables IV and V below for k= 2 have been obtained 
from the true distribution law, the close correspondence between the true and the 
approximate results in this marginal row of each table increases confidence in 
method (i), which has been used throughout the other rows of the table. 


5. Computation of Tables IV and V.t 

For reasons given in the preceding sections it was decided to adopt method (i). 
The process of computation was as follows: 

(a) Values of 1,’ and p,’ were obtained from (9) using 10-figure tables of the 
logarithms of the Gamma Functiont and Vega’s 10-figure logarithm tables. 
A large number of places of decimals were required, since ps’ — (,')?=0,,? was 
often a very small quantity. 

(6) These values of p,' and p,’ were inserted into (8) to give the constants m, 
and m, of (6). 

(c) Using these values of m, and m, it was in nearly all cases possible to find, 


* Edited by Karl Pearson and published from the Biometrika Office, University College, London. 


Notice that owing to the relation I (3) I (n—1)=2"2 TP (5) r (=) the constant in (16) may 


eae 
t Tables IV and V have been placed at the end of the paper. 
t Tracts for Computers, No. vm. 


be written in the usual form of 1/B CS 5) : 


P. P. N. Naver 45 


by interpolation, 5 per cent. and 1 per cent. levels for L, from tables of 5 per cent. 
and 1 per cent. points of R. A. Fisher’s z,* using the transformation 
L,=Nq/(n_,+n,e*), (18) 
which connects the Type I with the z-distribution. Here n, and n, are the number 
of degrees of freedom with which the z-tables are entered, and we must set Nn, =2ms, 
N,=2m,. The values of n, and n, were not, of course, integers, so that inter- 
polation in the z-tables was necessary. For this purpose I used partly R. A. 


Fisher’s tables and partly an extension of these tables available in the Department 
of Statistics. 


(d) This procedure was carried out for a framework of points, viz. 
n=3, 4, 5, 7, 10, 12, 15, 20, 30, 60, 
k=3, 4, 5, 7, 10, 15, 20, 30. 
Intermediate values were filled in using 4-point Lagrangian interpolation for- 
mulae. 


(e) The results which, in the first place, were calculated to four decimal places, 
were finally cut down to three places. Since-they have been based on an approxi- 
mation to the unknown true distribution, p (Z,), it was felt that the use of more 
than three figures would not be warranted. For practical purposes, however, this 
would appear to be quite sufficient. 


It will be seen that the tables give limits for all values of n and k between 3 
and t0.Above 10, the tabling for m is in harmonic intervals, as in R. A. Fisher’s 
z-tables, so that if x= 60/n = 60/(f+ 1), we have the correspondence 

n=10, 12, 15, 20, 30, 60, 0, 

Pemba 8) OF 1; 0). 
Linear interpolation for 7 in any row of constant k is now possible, using x as 
the argument. | 

The columns are also headed with the degrees of freedom, f, on which each sum 
of squares depends. In the application considered here f= — 1, but as Mr Welch 
points out in the following Note the tables may be used in other problems where 
f has a different value. 

For k, it was found that nothing was gained by using a harmonic interval; 
besides the changes with k in any column of constant are very gradual. 

Probability levels for the case of samples of 2 were calculated as follows: 


TABLE VI. Levels for L, in the case n=2. 


* Statistical Methods for Research Workers, Table VI. 


46 Neyman and Pearson’s L, Test 


It is possible that problems might occur in which the L, test could be applied 
usefully to samples of 2. If for example, duplicate observations were taken as a 
routine process in chemical analysis, the test might be used to examine whether 
the error of analysis was remaining stable. On examination of some data in this 
form it was realized, however, that since the modal value of s in samples of 2 is 
zero, it may often happen in practice, particularly if the scale of measurement is 
not very fine, that the two observational readings will coincide and s be zero. 
A single value of s=0 will make L,=0. I am not therefore clear of the utility of 
the test in this case, and have not entered these figures in the main tables which 
commence with the case n = 3. 

Reference has been made above to P. C. Makino s tables for L,. On com- 
parison it is found that a number of his figures differ somewhat from mine; it 
seems possible that this may be due to the fact that, as stated in his paper (loc. cit. 
p. 111), only 7-figure logarithms of the Gamma Function (instead of 10-figure) 
were used. In certain cases it is then only possible to calculate jp,’ — (p')? 
accurately to two or three significant figures. It follows that the final values for 
I, may suffer in accuracy. 


6. The limiting form of p (L,) when the values of n, are large. 

Neyman and Pearson have pointed out* that when the number of observations 
n, in each sample is large, the sampling distribution of L, may be obtained 
approximately from that of x? as follows: if 


4. 
1 


CL Ge ca brorsemanammeary meaty 
gk-) P (+ 


which is seen to be the usual x? distribution associated with f=k-—1 degrees of 
freedom. It follows that probability levels for L, may be obtained by inserting the 
corresponding levels for x? in equation (19). From the form = (19) it is seen that 


for large samples the distribution of L, depends only on N = z (m) and k, and not 


on the individual values of n,. This result will be referred a again in the next 
section; I have to thank Mr P. V. Sukhatme for drawing my attention to its 
practical importance.t 

In order to investigate the rapidity with which the limiting form is reached in 
the case where the sample n,’s are equal, I have compared in Table VII below, for 
n= 30 and 60, the 5 per cent. and 1 per cent. levels for L,, 


then (x2)? &-9) e-4x" | (20) 


* “On the problem of k samples’’. Loc. cit. p. 473. 
+ See his paper entitled ‘A contribution to the problem of two samples”, Proc. Indian Ac. 
Sci. vol. , no. 6 (1935); p. 595. 


P. P. N. NAYER AT 
(a) given in my Tables IV and V, 
(6) calculated from the x? approximation of (19) and (20). 


An examination of the figures suggests that the x? approximation may be used 
with safety for n > 60. This is a satisfactory result, since as n—> 00, all values of I, 
will lie close to unity, and the 3-figure accuracy of my tables will not be sufficient. 


TABLE VII. Probability levels for L,; x2 approximation. 


7. Case of samples of unequal size (n, not constant). 

As pointed out in the preceding section, as n,>0o (t=1, 2, ... k) the sampling 
distribution of L, may be approximated to through a x? transformation; under 
these conditions the distribution depends only on the values of k and N, or 
expressed in another way on the values of k and n=N/k, the mean sample size. 
If the values of n, are unequal and some or all are small, the true moment coeffi- 
cients of L, are given by (3), and probability levels can be obtained by the method 
of approximation (i) described in Section 4. It would clearly be impracticable, 
however, to compute tables which could be entered with all possible combinations 
of k and n,. The following question, however, seemed worth considering; when 
samples are of unequal size and not necessarily all very large, could approximate 
probability levels for L, be obtained by entering Tables IV and V with & and 
n= X(n,)/k=N/k? I have not been able to test this point in a systematic manner, 
but in five different cases have made a comparison of 

(a) 5 per cent. and 1 per cent. limits for L, obtained by entering Tables IV and 
V with k and 7; 

(b) the corresponding levels obtained from using the true yp,’ and p,’ for 
unequal n,’s (equation (3)), and then the method of approximation (i). 

Case (1). 
m,=13, n,=21, ng=Ng=23. 
N=80, k=4, 7=20. 
5% 1% 
Limits (a) -900 *859 
Limits (b) -900 +858 


48 Neyman and Pearson’s L, Test 


Case (2). 
m,=5, m=10, n= 15. 
N=30, k=3, n=10. 
5% 1% 
Limits (a) “792 -699 
Limits (0) -778 -679 
Case (3). 
N, tO N45=20, Nyg tO Ny5=30, Mog tO Ngo = 60. 
N=900, k=30, »=30. 
Case (4). 
Ny tO Myg=20, Ny7 tO Nog= 30, Ngq7 tO Myeg=60, M39= 100. 
N=900, k=30, n=30. 
Case (5). 


Nn, tons=4, NgtONyy=10, M3 tO Ny5=20, Nyg tO MyqQ=30, Ng, tO Nzq9= 60. 
N=900, k=30, m=30. 


For all the last three cases k = 30, 7 = 30, and the comparison is shown in Table 
VIII. 


TABLE VIII. 
| Limits (a) 
Case (3) H 
| Limite (6) {ee (4) | 951 
Case (5) -972 


The approximation fails badly in cases (2) and (5), but it is otherwise very satis- 
factory. It is clear that further investigation is needed. but as far as these results 
go they suggest that if none of the values of n, are less than 20, the tables may be 


entered with k and 7. 


8. Illustrative examples. 

The computation of L,, in the case where n, is constant, is extremely simple 
involving only the addition of sums of squares and of the logarithms of these sums 
of squares (see equation (4)). The following examples may be helpful: 


Example (1). Tests of tensile strength were made on certain Die Casting Alloys,* 
the unit being 1000 Ib. per sq. in. Five specimens of each of 25 different alloys 
were tested, and it was desired to investigate whether the variation between 
specimens of the same alloy was sensibly the same for all alloys, i.e. to test the 
hypothesis that the 25 samples of 5 had been drawn from populations having a 
common variance, o”, though possibly very different mean strengths. 


* The data were placed at my disposal through the kindness of Dr W. A. Shewhart of the Bell 
Telephone Laboratories, New York. 


P. P. N. Naver 49 


Table IX shows the 25 values of — Pe %,)* and log 2 AG: —%,)? | followed 


by the computation required to pererenins L,. For the case ee 25,n=5, Table IV 
gives the 5 per cent. level of significance for L, as -674. The observed value of -690 
is within this limit* and consequently there is no evidence for rejecting the 
hypothesis of a common o?. he best estimate to make of the value of o? is 

— E22 aH)? = 2-982. 


It is assumed, of course, that the ee in tensile strength is approximately 
normal. 
TABLE IX. Variation in Aluminium Die Castings. 


Sample —\o |S: 1 
id | 8p | » (%42—%;)? | log - (%42— 4)? ee 7 8 - (%4¢—%,)” | log a (ts —%,)? 
1 1-0121 5-060 7042 1, 3-0724 15-362 11864 
2 1-4120 7-060 -8488 18 5-2144 26-072 1:4162 
3 -9873 4-938 _-6936 19 9-7438 48-719 1-6877 
4 -1816 -908 1-9581 20 -5543 2-771 “4427 
5 1-5341 7-670 +8848 21 -4976 2-488 -3959 
6 -9336 4-668 -6691 22 1-5731 7-865 “8957 
? 3-1482 15-741 » 1-1970 23 2-1349 10-674 1-0284 
8 2-8627 14-313 1-1557 24 2-4228 12-114 1-0833 
9 2-7254 13-627 1:1344 25 -2374 1-187 0745 
10 2-1224 10-612 1-0258 4 
11 1-3545 6-772 -8307 Ss 298-193 22-8867 
12 1-6074 8-037 ‘9051 faa 
13 1-7096 8-548 -9319 3; x Sum 11-928 ‘91547 
14 1-9635 9-817 -9920 log (11-928) 107657 
15 2-8569 14-284 1-1549 3 : 
16 7-772 | 38-886 1-5898 Difference =log Ly ___ 183890 __ | 
L,=-6901. 


Example (2). As B. L. Welch has pointed out in his Note which follows, the L, 
test may be used to examine whether there are significant differences in the 
variance about regression lines fitted to a number of samples. ‘The present data 
are taken from Table V of a paper by G. W. Snedecort dealing with the use of the 
analysis of covariance in connection with certain grading problems among 
mathematics classes at Iowa State College. The students were taught in fourteen 
different classes not all of equal size, and for each individual there was recorded 
a mathematics grade, say y, and a general ability score, say x. Snedecor was 
primarily concerned with possible differences from class to class in the regression 
coefficients b, of y on 2. Since, however, the theory of the tests which he applied is 
based on the hypothesis that the variance of y about the regression lines is the 
same for all the fourteen classes, it is of interest to put this hypothesis to the test. 

In my notation k= 14; the n, are not equal, but as seen from Table X vary only 
between 17 and 21. I shall assume therefore that the average n, or 19-4 may be 

* It must be remembered that low values of L, suggest significance, not high values as in the 


case of many other criteria such as z or t. 
+ Proceedings of the American Statistical Association, vol. xxx (1935), p. 266. 


PSM 


50 Neyman and Pearson’s L, Test 


used in entering the tables. As Mr Welch has shown in this case where the devia- 
tions are taken from a fitted regression straight line, L, may be calculated as in 
equation (2) if 2 (y%;—%—%& (%- Z,))?==X(y'—b,x')2, say, is substituted for 
& (x; —%,)*, the sampling distribution of L, being as before except that n,— 1 takes 
the place of n,. It follows that the L, tables will be entered with k=14 and 
f=N—2= (271/14) —2=19-4-2=17-4. 

TABLE X. Variation in mathematical grading. 


= (y’— bya’)? | log & (y’ — 6,2’)? 


I 15-76 1-1976 
2 9-86 -9939 
3 18-83 1-2749 
4 9-68 -9859 
5 12-76 1-1059 
6 18-29 1-2622 
7 10-54 v* 1-0228 
8 8-40 +9243 
9 14-55 1-1629 
8-66 *9375 

10-73 1-0306 

19-34 1-2865 

21-44 1-3312 

14-71 1-1676 


15-6838 


193-55 


I must thank Mr Welch for working out this example for me. L, should, of 
course, be calculated as in equation (2), weighting the logarithms of the sums of 
squares with n,; if this is done it is found that L,=-956. If, however, we use 
equation (4), which may be regarded as an approximation to (2) obtained by 
assuming that n, is constant, it is found, on following the simple procedure shown 
in Table IX, that L, =-954. 

Turning to Table IV we may find the 5 per cent. level for L, both for k= 14 and 
f=17, and for k= 14 and f= 18, or we may interpolate for f= 17-4, remembering 
that we are dealing with an approximation. In the latter event we take as argu- 
ment 2 = 60/(f+ 1) = 60/18-4 = 3-26 and find for the limit 

-918 +-26 x (-890 —-918) =-911. 

The observed L, is within this limit, and there is clearly no evidence of differ- 

ences in the residual variation among the fourteen college classes. 


9. Summary. 

The main object of this paper has been to provide tables of 5 per cent. and 1 per 
cent. probability levels for the L, criterion when the k samples contain an equal 
number of observations. Investigations have been carried out into the adequacy 
of the method of approximation used; this has been compared with the true 
distribution for the case of two samples and also with the limiting form when the 
samples are large. 

It has been suggested that when the number of observations, n,, are not the 


if 2 3 4 5 6 ¢( 8 9 
es ee ee ee ee ee ee ee ee 
ae 3 4 5 6 tf 8 9 10 

2 | -312 | -478 | -585 | -656 | -708 | -745 | -775 | -798 

3 | -304 | -470 | -576 | -648 | -700 | -739 | -769 | -792 

4 | -315 | -480 | -585 | -656 | -707 | -744 | -774 | -797 

5 | -328 | -491 | -595 | -665 | -714 | -751 | -780 | -802 

6 | -339 | -502 | -604 | -673 | -721 |:-757 | -785 | -808 

7 | -350 | -512 | -612 | -680 | -727 | -763 | -790 | -812 

8 | -359 | -520 | -620 | -686 | -733 | -768 | -795 | -816 

9 | -367 | -527 | -626 | -691 | -738 | -772 | -798 | -819 
10 | -374 | -534 | -631 | -696 | -742 | -776 | -802 | -822 
12 | -387 | -545 | -641 | -704 | -749 | -782 | -807 | -828 
14 | -397 | -554 | -649 | -711 | -755 | -787 | -812 | -832 
16 | -405 | -561 | -655 | -716 | -759 | -791 | -816 | -835 
18 | -412 | -567 | -660 | -721 | -763 | -795 | -819 | -838 
20 | -418 | -573 | -665 | -725 | -767 | -798 | -822 | -840 
22 | -424 | -577 | -669 | -728 | -770 | -800 | -824 | -843 
24 | -428 | -581 | -672 | -731 | -7724 -802 | -826 | -844 
26 | -433 | -585 | -675 | -734 | -775 | -805 | -828 | -846 
28 | -437 | -589 | -678 | -736 | -777 | -807 | -829 | -848 
30 | -441 | -592 | -681 | -739 | -779 | -809 | -831 | -849 

TABLE V. 1 per 
¢ lh pce ae as Vee sae DL ele let Deke: Ak Se 
3 4 5 6 7 8 ) 10 
-284 | -398 | -485 | -551 


P. P. N. Naver 


same in each sample, it will probably be justifiable to enter the tables with k and 
the average n, or 7, provided no n, is less than 15 or 20. 
Two illustrative examples have been discussed. 
I am indebted to Professor E. 8. Pearson for much advice during the course of 
my work and for assistance in arranging the paper for publication. 


TABLE IV. 5 per cent. limits for L,. 


51 


1l 14 19 | 29 | 59 ~ 
12 15 |°20 | 30 | 60 ee) 
-730 | -783 000 


SES222355 


SS222555 


ee ee ee ee 
S g 


cE 


NOTE ON AN EXTENSION OF THE ZL, TEST 
By B. L. WELCH 


In a recent paper* dealing with problems of regression I showed how a criterion 
of the L, form, similar to that treated by Mr P. P. N. Nayer, arose when testing 
the hypothesis that the variance about the regression straight line of y upon x 
was the same in a number of populations. Let us now consider the case where 
there is more than one independent. variable. 

Suppose that in & different populations of individuals we measure / + 1 different 


characters, say y, ©, of), 2. of, (1) 


and suppose that within the tth population (¢=1,'2, ... k) the variate y for any 
fixed system of values of the a’s is normally distributed about a mean 

y (2... 29) = Boyt Bye? +... + Bye”, (2) 
with standard deviation o,. The equation (2) represents of course the regression 
equation of y on the x’s, and it is only assumed that it is linear for each of the 
k populations considered; the values of the coefficients 8 are not known and may 
vary from population to population. It is assumed that the standard deviations 
of the y’s do not depend on the values of the z’s and that they may vary from 
population to population. Suppose now that asample of n,individuals (¢= 1, 2,...k) 
is drawn from each of the k populations and denote by 

Yoi> Fg, o0- By (3) 

the values of the variables (1) corresponding to the ith individual drawn from the 
tth population. The joint elementary probability law of the y’s corresponding to 
any fixed system of values of the x’s is then 


P (Yar +++ Yang | Bors +++ Bre» O19 +++ Cs F115 +++ Cen,) 
|i cee | -35 (54) 
=(——) I|— 1 \o?/ , 4 
(Fz) t (an) ¢ ) 
where NV =X (n,) and 


= a (Yee — Bor— Bar %4¢ — «+» — By 2)? (5) 
The sum 2 and the product II must be taken throughout as over the values 
t t 


t=1, 2, ... k unless it is stated to the contrary. 
Let the hypothesis that we are interested in be that the values of o, (t= 1, 2, ... k) 
are all equal. Using the Neyman-Pearson approach to the problem we define two 


* “Some problems in the analysis of regression among k samples of two variables”, Biometrika, 
vol. xxvu (1935), p. 145. : 


B. L. WeEtcH 53 


classes of hypotheses; first, the set Q of all admissible hypotheses for which o1, 
Bu; --- By are unrestricted, and second, the set w for which o; is restricted to be the 
same for all ¢. Then if p (Q max) is the maximum value of p with respect to the 
parameters at our disposal in Q and p (w max) is similarly defined for w, the test 
criterion, L, is given by 
2 
9 = 
aye een y 
L,=0¥ = {eee (6) 
For p (Q max) we have o/? = 6; /n,, where 6,’ is the value obtained by minimizing 
6, with respect to the 6’s. For p (w max) we have o?= = (6,')/N. We obtain therefore 
t 


Me ate 

N\N 6/2 \N 
1 su] ‘ 7 
: (5) Ms (7) 


Particular cases of the application of this criterion are (a) Neyman and 
Pearson’s original case where only a mean is fitted in each group, i.e. /=0 and 


™ 
6, = X (y,—Y,)*,* and (b) the case of my paper, where the variances about the re- 
i=1 


™ 
gression line of y upon x are considered, i.e. = land 6 = =X {(y;—¥) — b, (x4 —%,)}. 
i=1 


We require the distribution of L, when the hypothesis to be tested is true, i.e. 
when o/?=07(t=1,2...k). Now 6/ is the sum of the squares of the residuals about 
the best fitting regression plane to the ‘th sample, and is therefore distributed as 
x20, independently of the fixed values of the 2’s. x, has n,—l—1 degrees of 
freedom. Thus when a;? is the same for all ¢, L, takes the form 


LG ue 
N\WN x? la , 
L,=T(—} Wi. .] , . 8 
SL ae 


where the ,? are all independent. The distribution of L, is not known but its 
moments may be calculated. They are in fact obtained from Nayer’s equation (3) 
(p. 39 above) if n,—1 be substituted for n,. 

As it is possible that in other problems a criterion may be used for which the 
weights associated with the terms in the product are not ,/N, it is perhaps of 
interest to consider a quantity DL given by 


x (w)) Vike (w,) 
L=I ea Il 
t 


t UW 


(9) 


where w, are any weights, and the x? (t=1, 2, ... k) are distributed independently 
as x? with f, degrees of freedom respectively. Starting with the joint distribution 


* “On the problem of & samples ” Bulletin de V Académie Polonaise des Sciences et des Lettres, 
Série A (1931), pp. 460-481. 


54 Note on an Extension of the L, Test 
of the x2, introducing as new variables ¢)=Z(x?) and ¢,=x/?/=(x/?) for 
t i 


t=1,2,...k—1, and finally integrating out with respect to ¢, from zero to infinity 
we get the following joint distribution for $,, $2, ... Px_1: 
ei 


kil k-1 \2 
P(bibe—-te=Of Tl ge (1-4) (10) 
Also L may be written in terms of ¢’s, 
2 (w) slg ae k-1 ©, (w+) k-1 \¥x/%(;) 
L=1}+ {Hl 4 }(1- = 4] adie (11) 
tenant, t=1 = 
It follows that the moments of L about zero are given by 
(pic) ae Se ee 
=c0y ie {fa oe} (1 = 4] dp, ...dgy, 4, (12) 
baa t=1 
all 


where a,= 9 and the region of integration is determined by the conditions 


E()* 
t 


k-1 
0<¢, (t=1, 2,...k—1) and & ¢,<1. Evaluating the integral we obtain 
t=1 


(ft) ae ha! 
(2 (w ae w+) rf ie (2 +} 9] 
fF ale bole os (13) 
0 | | (: (fi) t P i) | ; 
if ee 2 

In our present case the moments of L, are obtained by substituting w,=n, and 
fx=m—!—-1 in the above equation. As was done in the cases of a single variate 
and of linear regression we may approximate to the distribution of L, by means 


of a type I curve with the same mean and variance as those of the true distribution. 
If there are the same number of observations in every group (i.e. n,=7), then 


ae 
oe eG) | 


where f=n—1—1. The distribution is now dependent only on k, the number of 
groups, and f the number of degrees of freedom within each group after the 
regression relation has been fitted. The tables given in P. P. N. Nayer’s paper can 
thus be used to obtain the 5 per cent. and 1 per cent. levels of significance of L,, 
if the entry is made with k and f=n—I—1 (instead of f=n—1). If L, is signi- 
ficantly small we reject the hypothesis of equal residual variances in the k popula- 
tions. 


(14) 


B. L. WrEicH 55 


If the number of observations, m,, is not the same for each group, Nayer’s work 
suggests that, provided the n,’s are not too dissimilar nor too small, we may enter 
the tables with k and f=7%—1—1, where 7 is the mean Ny. 


Illustrative example. 


As an example we shall consider some data collected in the course of the manu- 
facture of spectacle glass. The technical details and a fuller discussion of the data 
have been given elsewhere.* Briefly it consists of counts of the number of small 
bubbles (or seed as they are calied) in 150 samples, one from each of 150 flat sheets 
of glass. A high number of seed represents poor quality glass, since the discs cut 
out for the manufacture of spectacles must be free from seed. The 150 observations 
fall into 5 groups of 30 (¢= 1, 2, ... 5) separated in manufacture by comparatively 
long periods of time: the first group (t= 1) is given in Table I. The reason for the 
division into 10 sub-groups of 3 is as follows. In the course of production the glass 
is fused in pots in a furnace, one filling of a pot furnishing about 20 sheets of glass; 


TABLE I. Observations of seed in first group (t= 1). 


3rd_ cylinder 49 34 47 16 
10th cylinder 62 60 93 29 
16th cylinder OF 2) 872 S113 736 


a sheet is obtained by drawing molten glass out of the pot, blowing it into the 
form of a cylinder and then opening and flattening out the cylinder. In the 
present investigation samples were taken only from the 3rd, 10th and 16th 
cylinders drawn out from any one filling of a pot and thus the data can be classified 
according to cylinder number and to filling. 

We may then follow out the usual procedure of analysing the variance among 
the thirty observations into three parts due to differences between fillings, differ- 
ences between cylinders, and residual variation when that associated with 
cylinders and fillings has been allowed for. This analysis is given in Table II. 


TABLE II. Analysis of variation of observations in first group (t= 1). 


D f | 
Source of variation Sum of squares ates lee Mean square 
Between cylinders 15 826-2 2 7913-1 
Between fillings 4118-2 9 457-6 
Residual 6 795-1 18 377-5 
Total 26 739-5 


* See a paper entitled “Statistical methods applied to the manufacture of spectacle glasses’’, by 
C. E. Gould and W. M. Hampton, read before the Industrial and Agricultural Research Section, 
Royal Statistical Society, May 1936, and to be published in the Supplement. to that Society’s 
Journal. 


56 Note on an Extension of the L, Test 


Similar analyses may be performed for the other four groups of 30. The only 
terms that we shall be interested in here are the five estimates of residual variance 
and we shall ask whether these estimates differ significantly among themselves. 

Let 1 (=1, 2, ... 10) denote the filling and j (=1, 2, 3) the cylinder. Then any 
observation in the tth group may be written y,,,. Behind the analysis of Table II 
lies the assumption that we can write Y;; in the form 

Yuig =Ayt+ By t Cy + ij; (15) 

where A, denotes the expectation of y over the whole group ¢, B;; a correction to 
be added to A, to give the expectation for the 7th filling in that group, C,; a cor- 
rection to give the expectation for the jth cylinder, and ¢,;,a random term normally 
distributed about zero with standard deviation o,. We must of course have 


i 9 


o, is the true residual variance and the hypothesis that we are concerned with is 
that o, is the same for all t. This is exactly the problem dealt with above. For 
although in equation (2) we regarded the x’s as measuring / characters of the 
individuals, this is not necessary. Since the 2’s were regarded as constants 
throughout, (2) may be considered as any linear function of the parameters 
Bu --- By and thus equation (15) falls into the scheme. A, corresponds to Bo, and, 
if allowance is made for the conditions (16) by replacing in each group B,; and C,; 
by 11 (=3+ 10 —2) independent parameters, then these parameters will corre- 
spond to f,,... By. Thus the criterion for testing the equality of the o,’s is given by 
substitution into (7) of the residual sums of squares with f= 30 — 11 —1 degrees of 
freedom obtained after fitting the parameters of equation (15). From Table II 
we obtain 6,’ = 6795-1 and from the analyses for the other four groups, 6,’ = 4046-9, 
63’ = 5067-5, 0,’ = 3955-55 and 6,’ = 7361-80. The numbers of observations in the 
groups being the same, (7) reduces to the geometric mean of the 6’’s divided by 
their arithmetic mean, i.e. L;=-9675. In Nayer’s table the 5 per cent. point of L, 
is not given for k=5 and f= 18, but for k=5 and f= 19 it is -903. Hence our L, is 
not significantly small, and a common estimate of the residual variance, with 
90 degrees of freedom, can be obtained by combining together the 5 sums of 
squares each having 18 degrees of freedom. 


TESTS OF CERTAIN LINEAR HYPOTHESES AND THEIR 
APPLICATION TO SOME EDUCATIONAL PROBLEMS 


By PALMER O. JOHNSON anv J. NEYMAN 


CONTENTS 
PAGE 
I. Introductory PP ax 57 
(a) Some conceptions and oe of the ee of eos statistical ae 57 
(6) Educational problems dealt with in the present paper : : : : 62 
Il. Statistical form of the problem of “‘matched groups” : : ‘ : F 64 
(a) Theory 3 ; : 5 : 5 : ; : 2 : 64 
(6) Numerical iiissiaiaons R < < : é ; : : : ‘ 70 
III. More general approach to the same problem : : : : : ; : 72 
(a) Statement of the problem . : , E : ; ; j : , 72 
(6) Hypothesis H (z’, y’) : ; : ; 73 
(c) Details of the experiment influencing the eee of Peat : : : Hig 
(d) “Region of significance” . : : : : : : 78 
(e) Other hypotheses A : : ; : ; 3 80 
(f) Practical application of ite pescdine theory : ; : ‘ : § 81 
IV. Summary of results . : : : : ; 5 ; : : : : 90 
VY. Numerical data concerning a students collected by Clara Brown and Palmer 
O. Johnson. : : : 5 ; : : : ; ; 91 
VI. Supplementary Table of the Postbus Beta Function . 5 ‘ : : 93 


I. IntTRopvuctTorRy 


(a) Some conceptions and methods of the theory of testing statistical hypotheses. 


Mathematical treatment of any problems of application, such as those con- 
cerning educational questions, must start with a “translation” of the particular 
problem into mathematical language. The educational problems referred to below 
can be expressed mathematically as problems of testing statistical hypotheses. 
The general theory of testing statistical hypotheses has been developed by 
Neyman and Pearson.* A recent paper gives a useful application of the general 
ideas to the test of a class of what are called “linear” statistical hypotheses, 
which seem to be very common in questions of educational psychology.t 


* (a) J. Neyman and E. S. Pearson, ‘‘On the use and interpretation of certain test criteria. ..”’, 
Biometrika, vol. xx4 (1928). (b) J. Neyman and E. S. Pearson, “On the problem of two samples”, 
Bull. Acad. Polonaise, Sci. Let. (1930). (c) J. Neyman and E. 8S. Pearson, “On the problem of k 
samples”, ibid. (1931). (d) J. Neyman and E. S. Pearson, “On the most efficient tests of statis- 
tical hypotheses”, Phil. Trans. Roy. Soc. A, vol. ccxxxi (1933). (e) J. Neyman and E. S. Pearson, 
“The testing of statistical hypotheses in relation to probabilities a priori”, Proc. Camb. Phil. 
Soc. vol. Xxrx (1933). 

+ St Kotodziejezyk, ‘On an important class of statistical hypotheses”, Biometrika, vol. xxvii 


(1935). 


58 Linear Hypotheses and Some Educational Problems 

We shall start by explaining what is meant by a “linear hypothesis” and quote 
St Kolodziejczyk’s results as to the appropriate tests. 

Denote by a Poteet ead 25 
any observational data and assume that 


(1) each of the a’s may be considered as an independent observed value of a 
normal variate; 


(2) it is known that the mean, m,, of any 2, say %;, is a linear function of s <n 
unknown parameters 7;: 
M,= Ai Py t+ QigPet +--+ AisDs (¢=1,2, 1); (2) 
the coefficients a,; being known numbers; 
(3) out of the n equations (2) it is possible to choose at least one system of s 


equations which it would be possible to solve with regard to the p’s, the deter- 
minant of the system being different from zero; 


(4) the variances of all 2’s about their respective means are equal, although 
their common value o? is generally unknown. 
If the above conditions are satisfied then any hypothesis, H, which it is possible 
to express by means of say r <s equations, with regard to the parameters p, say 
8; =b 51 Py + Oj Pot... +O; Ds= B; (j= 1, 2, sat) (3) 
where B,; and the b’s are known numbers, is called a linear hypothesis. 


St Kolodziejezyk has shown that the test of a linear hypothesis reduces to 
the following operations. Denote by 


x2= XZ (a, —m,')?, (4) 
i=1 


where m,’ means an expression similar to (2), 
Mi! = Aid + Finda + »+- + igs (1=1,2,...m), (5) 


except that instead of unknown constant parameters p in (2), the formulae (5) 
contain continuous variables q. 


(1) The first step in testing the hypothesis H consists in finding the minimum 
value of x? in (4), where the w’s are regarded as constant, and all the s variables q; 
are free to assume any values from —0oo to +00. The process of minimizing x? 
consists in finding the derivatives with respect to all q’s and in equating them to 
zero. In this way s independent equations may be obtained which determine the 
values, say q;°, of the q;’s, which minimize y”. Substituting q¢,° into (4) we obtain 
the minimum value required. This will be called the absolute minimum and 
denoted by S,?. 


(2) The next step consists in assuming that the variables q,, qo, ... g, satisfy the 
equations biG: t Ojeda + ++ +0;e%=B; (g=1,2,..-7) (6) 


which express the hypothesis tested, and in finding the minimum value of x? in 


PALMER O. JOHNSON AND J. NEYMAN 59 


(4) under this assumption. This minimum of x? will be called the relative minimum 
and denoted b 
_ S,2=S,2+ 8,2 (say). ee 
Obviously the relative minimum will be at least equal to the absolute one. 
The easiest method of obtaining the relative minimum will depend upon the 
equations (6). Often these are readily solved with regard to some of the q’s. 
Suppose that the equations (6) may be easily solved with regard to q,, qo, --» Gs 
expressing them in terms of ¢,,1, Y-42, +++ Qe: 
Substituting these expressions in (5) we shall get, say, 


m,! = Carat Gryit Cir42%r+2 Feet OF Pe (8) 
and the problem of the relative minimum with regard to variables q, satisfying 
(6), will be reduced to that of an absolute minimum with regard to s—r variables 
Gr+1 +++ Is- 

But sometimes the solution of the equations (6) will be laborious. Then it will 
be convenient to use the method of undetermined multipliers of Lagrange. This 
consists in calculating the partial derivatives with regard to all q,’s («= 1, 2, ... s) 
of the function, say, 


$= a (x;—m,')?—2 — 0; (B5191 + Djoda t+ --- +Dj59s); (9) 
w= = 


where the «’s are the undetermined constant multipliers. The values of the q’s, 
Say 4;(,), Which minimize ¢ will be certain functions of «,, %),... «,. Substituting 
4: («) into the equations (6) we shall obtain r conditions from which it will be 
possible to determine the appropriate values of the «’s. Denote these values by 
«,°. Lagrange has shown that the values of q; which could be obtained by sub- 
stituting «,° instead of «,; are the values both minimizing x? and satisfying (6). 
When the values of the q’s giving either an absolute or a relative minimum to 
the sum x? are found, the process of obtaining S, or S, may be simplified by the 
following method. 
Denote b cg 
> x? (z) = es (2a 5 — 5 Gy — Vin Ia — ++» — Vig Is)” (10) 
t= 
z being a new variable. Obviously x (z)= x if z=1. Now x?(z) is a homogeneous 
polynomial of second order in z and the q’s. 
Therefore we may write the identity of Euler 
1 (0x? Ox? ox? 
5 Set Matt all, (11) 
Substituting z= 1 and q;=4q,° we shall get both 0x?/0q;= 0 and x? (z)=S,?. 
Therefore the formula (11) reduces to 


gad = UP — Gy Ly Xj — Jo VA jg Xj — -.. — Yq” UA jg; (12) 
Ga=a° 


If it is desired to calculate S,2, we may proceed in a similar way. Denote by 


60 Linear Hypotheses and Some Educational Problems 


q; the value of g; which gives to x? the relative minimum value and satisfies the 
equations (6). Then the numbers g; must satisfy the equations 


Op dx? Z ; 
x= ~~ -2% a,b =(0 ((=1,255568) (13) 
09; 0g; j=1 ak u=% 
ox? r 
or Gq? >» 4559; * (14) 
%% j=l u=G 
Summing up this equation for 1=1, 2, ... s, and taking into account (6), we 
obtain ax? ax? ‘ a 
SG + Got. $+ =2 2 «, B,. (15) 
oq, oq, 0 ae an jal” * 
if Ox? oP 
thence from (11) S==~—2\z-1 + 2 a, B;. (16) 
202 la=a j=1 


This formula is particularly useful when all B’s are equal to zero, which will be 
often the case. Under this assumption 


SP = La — Gy Las, U; — Jo Uj; — «. = Gq UAig;. (17) 
(3) The third step consists in calculating the ratio, say 
ee 


of the absolute and the relative minimum of y? and in referring to the Tables 
of Incomplete Beta Function* in order to determine P, or the probability deter- 
mined bv the hypothesis tested of { having a value equal to or smaller than that 
observed. Thus 
pa FED Bey, g{ paa-ped=L(e.g, (19) 
B (p, q) ‘ 0 yaa 
where p=4(n—8), g=#r. 

As was shown by Neyman and Pearson P thus calculated is equal to the upper 
limit of the probability of an error when rejecting the hypothesis tested. Therefore 
the hypothesis should be rejected only if P is small, say -05, -01, etc. 

K. Pearson’s tables are calculated for p and q not exceeding 50. Therefore they 
may be readily applied whenever n < 100+s. 

Ifn is larger than the above limit, Table V at the end of this paper may be used. 

Suppose now that after having applied the test we decide to reject the hypo- 
thesis tested, that the functions 0; = B; (j= 1, 2, ... 7). The question will then gen- 
erally arise of estimating the actual value of any of these functions, say that of 6;. 
It follows from a theorem by A. A. Markoff} that the best linear estimate, say 
F,, of 6; is given by an expression 

P= 55919 + 050909 + «.. +5 595° (j= 152.508) (20) 
* Karl Pearson, Tables of the Incomplete Beta Function (London, 1934). The relation between 
¢ and R. A. Fisher’s z is referred to on p. 90. 


ft A. A. Markoff, Wahrscheinlichkeitsrechnung (Leipzig, 1912). See also J. Neyman, J.R.S.S. 
vol. oxvu, part 4, pp. 558-625. 


PatMER QO. JOHNSON AND J. NEYMAN 61 


which is obtained from the expression of 6; merely by substituting the values 
q,°, minimizing x’, instead of the corresponding parameters. According to the 
same theorem, combined with the results of St Kolodziejczyk, the estimate of 
the variance of F;, say v,, is obtained in the following way. 

We calculate S,? and S,?+ S,?, or the absolute and the relative minima of the 
sum x?, as we should do to test the hypothesis, say H,, which is expressed by only 
one equation 6;= B;. Then the estimate v, of the variance of F,, is given by 


F'n—-s 8,2 C1) 


S,? being the difference between the relative and the absolute minima of x2. 

It is important to note that the property of the formulae (20) and (21) of giving 
the best linear estimate and the estimate of the variance of F,; do not depend 
upon any assumption of the normal distribution of the original observations. In 
spite of this, it may be proved that F; is approximately normally distributed, 
whenever the number of observations is not very small. 

It may be useful to notice that the first ratio in the formula (21), namely 
S,7/(n—8), presents the estimate of the unknown variance of single observations, 
o*, while the other ratio, (F';—B;)?/(S,?), is a constant coefficient, depending 
both on the particular form of the function 6; and on some properties of the 2’s. 
The accuracy of the estimate Ff’; may be measured by its variance. It is impossible 
to influence the value of 8,?/(~—s). On the other hand, it is sometimes possible 
to adjust the conditions of the experiment so as to minimize the other term in the 
expression (21), namely (F;— B,)?/S,?. This will be shown below. 

It will be seen that the most essential part of the statistical treatment of the 
educational problems which will be considered below, consists in presenting them 
in the form of a statistical problem of testing a linear hypothesis. If this has been 
done, the process of finding the appropriate criteria and the reference to tables 
will follow automatically. 

We have to call attention to a certain difficulty which may sometimes arise. 
One of the conditions mentioned above consists in the variance of each observation 
x, being constant. To obtain perfectly reliable results, this circumstance must be 
tested.* It is at present difficult to suggest the proper way of dealing with the 
problems in cases when the variances o,* of the 2,’s prove to be all different. 
However, it may be presumed that if the difference between the o,’s is only 
moderate, it will not seriously invalidate the test. It is hoped that this point will 
be considered in another paper. At present we shall assume that the variables 
with which we shall have to deal all have equal variances. The need for sound 
statistical methods in educational work is so great that we feel justified in 


* See (1) B. L. Welch, “Some problems in the analysis of regression among k samples of two 
variables”, Biometrika, vol. xxvm (1935), pp. 145-160. (2) B. L. Welch, “Note on an extension 
of the L, test”, this volume, pp. 52-56. 


62 Linear Hypotheses and Some Educational Problems 


publishing the present paper even though its results are subject to serious limita- 
tions.* As a further excuse we may mention that many previous publications on 
the subject involve even more stringent conditions for the validity of results, 
e.g. assume that the distribution of all characters concerned may be represented 
by a normal surface. Ht will be seen that our assumptions are less restrictive. 

Further on we shall sometimes use expressions like the following: “It has been 
proved that the learning ability of...”, etc. We want to emphasize that such 
expressions should be understood as it were in inverted commas and that there 
are two important restrictions concerning them. 

(a) Speaking of learning ability we have in mind only some measurable cha- 
racter related to the somewhat vague term “learning ability”, namely the 
character which we have agreed to consider as its measure or substitute, but 
which is certainly different from the learning ability itself. 

(b) ‘It has been proved” means merely that assuming a certain risk of error 
as measured by the value of P described above, we haye decided to reject a certain 
statistical hypothesis. In this we may be wrong for two reasons: (i) we may be 
unlucky and be led to a false conclusion by random sampling variation, though 
the probability of this is small and does not exceed P; (ii) more frequently we 
shall be wrong because our translation of the particular educational problem 
(see p. 57) into the mathematical language is inaccurate. 


(6) Educational problems dealt with in the present paper. 

In education, as in other fields, there are many problems which may be con- 
sidered primarily as statistical problems. The problem may be one of determining 
the sex difference in learning abilities, the relative efficacy of different teaching 
procedures or specific techniques of instruction, the more economical of two 
methods of memorization, the effect of class periods varying in length, the effect 
of varying class sizes, the influence of teacher personnel, or any number of problems 
of a similar nature, where the purpose is to determine the effect of an experi- 
mental factor. The method customarily employed for investigations of this 
nature is the so-called “controlled experiment” which in theory is an attempt to 
apply the “law of the single variable’. In an attempt to secure homogeneity 
between groups of individuals constituting the subjects of the experiment, it is 
customary to pair individuals on characteristics that are known or assumed to 
have a relation to learning abilities. Among the bases that have been used are 
(1) measures of general intelligence, as measured by intelligence tests, given in 
terms of intelligence quotients, point scores, mental ages, standard scores, etc., 
(2) chronological age, (3) initial status with respect to the trait under considera- 
tion, (4) sex, (5) interests, attitudes, or other personality traits, (6) race, ete. 
Some evidence exists that most of these factors bear some relation either directly 


* See Karl Pearson, “Thoughts suggested by the papers of Messrs Welch and Kotodziejezyk”’ , 
Biometrika, vol. xxvu (1935), pp. 228-259. 


PALMER O. JOHNSON AND J. NEYMAN 63 


or indirectly to learning abilities. It is a difficult matter in ordinary school 
situations, or even in experimental schools, to match students upon all these 
bases at once. Matching on a single measure, such as intelligence, is most fre- 
quently used. Individuals with identical or nearly identical measures are assigned 
to alternative categories and constitute what are called “matched groups’’.* 
When distributions have been made in this manner, the extent to which the 
groups are equivalent with respect to other characteristics can be determined by 
computing statistical constants on those traits for which more or less valid and 
reliable measures are available. If too great discrepancies occur, an attempt at 
adjustment is made in order to obtain more closely balanced groups. It may be 
pointed out here that most pupil characteristics are not independent traits, so 
that by matching on one basis we are also partially matching on another. The 
difficulties in securing an adequate number of pairs are apparent even in theory, 
to say nothing of the difficulties in practice. What is needed is some correction 
which must be made for any inequalities between matched groups. Perhaps 
better still is a treatment which makes possible an analysis of the conditions as 
they exist, rather than a reliance upon this “bodily ”’ shifting of students, which 
in many instances interferes with the school machinery, and for this reason 
precludes most experimental work. 

Let us say that as careful a matching of groups as is practical has been made. 
The experimental factor is introduced; the experiment is under way. When one 
recognizes the complexity of a learning situation and the difficulty of determining 
the independent effect of the many factors in operation, it is clear that we cannot 
legitimately assume that the “law of the single variable”’ is in operation. But the 
conditions of the two or more groups are kept as identical as possible, with the 
exception of the experimental factor the effect of which is to be determined. 

How are the consequences to be measured, i.e. the respective outcomes of the 
two distinctive teaching procedures? The success in this endeavour depends 
greatly upon the validity and reliability of the measuring instruments, how com- 
pletely, and how accurately they measure the purported outcomes of instruction. 
The problems involved in the construction of measuring instruments have been 
elsewhere discussed.t It should be obvious that a determination of the initial 
status of the subjects with respect to the abilities measured, as well as the status 
at the conclusion of the experiment, is essential. In much of the work that has 
been done, the practice has been to subtract the initial from the final score to 

* Of course there is another method of matching, which consists in selecting two larger groups 
of individuals with the restriction that the means and perhaps also the variances of the characters 
of matching should be in both groups equal. In connection with this point see further, p. 77. 

+ August C. Krey and- Palmer O. Johnson, ‘‘ Differential functions of examinations, Uni- 
versity Committee on Educational Research”’, Bulletin of the University of Minnesota, vol. XXXVI, 
no. 4, January 25, 1933. Palmer O. Johnson, “The measurement of outcomes of instruction other 
than information”, School Science and Mathematics, January, 1934, pp. 26-33. Studies in College 
Examinations (Minneapolis; University of Minnesota, Committee on Educational Research, 1934), 
p. 204. 


64 Linear Hypotheses and Some Educational Problems 


obtain a measure of growth in point scores, or the percentage of gain in terms of 
initial status or of a theoretical psychological limit, i.e. the total attainable score. 
This method may be legitimate. However, special care must be taken in order to 
avoid conclusions based on spurious correlations. . 

The experimental method has been described in some detail. While controls 
are more frequently employed in the “controlled experiment’’, they should 
represent a salient feature of all comparisons that are carefully made. »Lack of 
control should be recognized and allowances made for its inadequacy when 
interpreting the results. In the method of paired comparisons, for example, 
individuals should be matched, or allowances made for inequalities which exist, 
with respect to all relevant characteristics in order to establish the significance 
of the factor under investigation. Thus, if the problem is to compare the academic 
achievement of students who come from broken homes with that of students who 
come from stable homes as it is difficult to guess what factors are relevant, it 
would be necessary to have all factors identical or to. make allowances for non- 
equality, with the exception that one group comes from broken homes, the other 
from stable homes. If the control consists of standardized norms, it would be 
necessary to know the characteristics of the group upon whom the norms were 
established in order to permit comparison with another group under consideration. 
There are many other situations where comparisons are frequently desirable. 
For example, a teacher may be interested in the sex differences in achievement in 
a class in biology; here again the factors bearing a relationship to achievement 
must enter into the analysis and interpretation of the findings. 

It is clear that whatever the matching and whatever the characters of the 
individuals which are taken into account when comparing the effects of an educa- 
tional factor, it will be impossible to eliminate all individual differences influencing 
these effects. Therefore the effect of the educational factor sought may be measured 
only as an average relating to a group of individuals. This is the reason that 
educational problems such as mentioned above are essentially statistical problems. 

The purpose of the present paper consists in indicating some modern statistical 
methods which may be used in treating a broad class of educational problems. 
Theoretical considerations which follow will then be illustrated on actual 
numerical data collected by P. O. Johnson. 

Some of the theory given below has been discussed by Neyman in the course 
of his lectures delivered at University College, London, in 1934. 


IJ. STATISTICAL FORM OF THE PROBLEM OF “MATCHED GROUPS” 
(a) Theory. 

The general description of the ‘““matched groups’? has been given in the 
introductory section (6). We shall now have to consider how the problem of the 
comparison of two such groups may be reduced to the form of a test of a linear 
hypothesis. We shall assume that the educational problem consists in stating 


PALMER O. JoHNSON AND J. NEYMAN 65 


whether some difference in educational conditions, say A and B, influences a 
certain numerical character, X, of the individuals. 

As was mentioned, the matching of groups consists in selecting pairs of in- 
dividuals having the same values of certain characters, which we shall call the 
basic characters of matching and denote by BC. Suppose that out of the indi- 
viduals available it was possible to select m matched pairs. Then we form out of 
these pairs two groups, G, and G,, of m individuals each, so that each group con- 
tains one representative of each matched pair. The groups G, and G, are then 
subject to educational conditions A and B respectively, which it is desired to 
study. One of them is usually called the experimental group and the other the 
control group. This is of course a conventional distinction. At the end of a certain 
time the experimental character X of all individuals in both groups is measured 
and these measurements are used to judge whether the difference in conditions 
A and B has a significant effect. 

It will be noticed that in this procedure it is tacitly assumed that (1) a difference 
in basic characters of matching may influence the value of the experimental 
character X, and (2) that the difference in the value of X due to the difference in 
conditions A and B, if measured on individuals forming a matched pair, is the 
same whatever the values of the basic characters appropriate to this particular 
pair. This, of course, is apart from random variation. 

We shall treat the problem keeping in view the above assumptions. However, 
we shall try to make our treatment as general as possible, avoiding assumptions 
which are not essential. Such is the assumption that among the individuals 
available it is possible to select only pairs of matched individuals. However 
difficult it may be to find two individuals with the same values of the basic 
characters, it may happen occasionally that we shall be able to find three, four or 
even more such individuals corresponding to one fixed value of BC, while it may 
be difficult to find even pairs of matched individuals with some other values of 
BC. The observational material should be as large as possible, and it is obvious 
that we should try not to omit any single individual which may be used for 
comparison. For this reason the matched groups, which we shall discuss here will 
have a somewhat more general character than they have in other publications. 

The process of matching as described here consists, then, in classifying all the 
individuals which may be used for comparison according to the basic character 
of matching in a number, m, of classes, so that each class contains at least two 
individuals. Denote generally by n,; the number of individuals in the ith class, 
say C;. Each class, C;, will now be divided into two subclasses C,’ and C;,’’ 
which will serve for comparison. The total of subclasses C,’ will form one of the 
matched groups, G,, the total of subclasses C;;’’ the other group, @,. The matched 
groups G, and G, will be subjected to educational conditions A and B. Probably 
there will be a general tendency to make subclasses C;,’ and C;’ equal in size, but 
this will be only possible when 7, is an even number. However, certain conditions 


PSM 5 


66 Linear Hypotheses and Some Educational Problems 


of the experiment may sometimes prevent the subclasses C;’ and C;'’ being of 
the same size, even when n, is not an odd number. Such is for instance the case 
when the subclasses C;' and C;’’ are formed by individuals of different sex. We 
shall denote by n,’ and n,’’ the numbers of individuals in the two subclasses C;,’ 
and C,”. 

Denote further by z;; the value of the experimental character X corresponding 
to the jth individual in the subclass C;,', on which it is desired to study the effect 
of the educational conditions considered. The value of the same character X of 
the jth individual in the subclass C,’’ will be denoted by y;;. 

We shall now consider in some detail the assumptions concerning the variables 
x,; and y;; which, perhaps only subconsciously, must have been in the minds of 
those who designed the method of matched groups. 

Obviously it was assumed that the 2,,; and y,; may be correlated with the basic 
characters of matching. Otherwise these characters would not be taken into 
account. Mathematically this assumption may be expressed by saying that the 
population of individuals which would have been included into the class C,, if it 
were subject to the conditions A, would have the mean of the character X: 
different from that corresponding to class C;, if 4k. The same distinction applies 
if the classes C; and C; were subject to the conditions B. 

Denoting generally by &(«) the population mean of any variate «, we may 


see E (ty) =p, & (Yy)=De- (22) 


If the difference in conditions A and B has any influence on the average level 
of the character X, then it will follow that 


Pi —Pi=A;, FO. (23) 


Now we come to the second assumption which probably underlies the idea of 
matched groups. As all subclasses forming the matched groups are serving the 
same purpose of comparing the effects of conditions A and B, it means that all 
classes C'; have something in common, in spite of the recognized differences in 
characters which served for classification. In other words, it was probably 
subconsciously assumed that while the characters used for classification could 
influence the means of x;; and y;;, they could not influence the difference between 
these means. Mathematically this is expressed by writing 


& (X43) — € (Yas) = pi’ — py = A = const. (24) 

or Dp =p, +A. (25) 
The hypothesis, H, that it is desired to test may now be expressed by the 

equation eae (26) 


It will be seen that we have succeeded in presenting the problem of the com- 
parison of two matched groups in the form of the test of a linear hypothesis. 


PALMER O. JOHNSON AND J. NEYMAN 67 


In fact we may now write 
6 (%4;)=p,+A (¢=1, 2, ...m, j=1, Paes via (27) 
6 (Yi3) = Di (¢=1, Bisrecs ty j=l, 2, amit, ); (28) 


where the p’s and A play the réle of the parameters of the theory. The population 
mean of each observation is a linear function of these parameters. 

There remain only the assumptions of normality and homoscedasticity, i.e. 
that both x,; and y;; are varying normally about their respective means with 
S.D.’s which have all the same value o. 

Assuming this, we readily obtain the solution. The sum of squares, x2, is 
written as follows: 


roll 8 nit 
t= E (E (ey—a-ao?+ E (vga), (29) 
i=1 \j=1 j=l 
where the variable q; corresponds to the unknown parameter ; and the variable 
Jo to A. 
We have 
Ox? m ni’ 
ago = Rice (%3- 9% —%) =9 
Ox? ni’ ne” (30) 
= -2( 2 (X45 — i—%) + % (vy—4))=0 (t= 1, 2,...m) 
0g, j=1 j=1 : 
Denoting Ed ah Wn aoe (31) 


we may simplify the equations (30). Solving them with regard to the q’s we shall 
obtain their values, minimizing x’, 


i N,N,” =a ; 
oe N; (Z; Yi) ‘ (32) 
Yo baad . n,n, > 
i=1 1% 
0 1 = "= emai 33) 
% = 5, (me Zit Ns Yi-—M W)- ( 
4 


According to formula (12) the absolute minimum of the sum x? may be written 


m [ ni’ ni” 
<= = 2 0 i— 
S7= 2 (3 Hyg? — GQ, Zi — Jo; Z; + 2 Yi GN; 7.) 
J= 


4=1 \j=1 
m nn” S " 2 
pee ae vale (= @-9) 
= 3 (Zayed yt (nd B+) - = (84) 
i=1 \j=1 j=1 nN, > N,N 
i=1 1; 


Let us now calculate the relative minimum of x? assuming that the q’s satisfy 
Be 


68 Linear Hypotheses and Some Educational Problems 


the equation (26) expressing the hypothesis tested, or in other words that q)= 9. 
Thus the expression to minimize will now be, say, 


m [ni ni” 
= 2 & (%j;-G)?+ 2 (vga?) (35) 
i=1 \j=1 (EE 
Proceeding as formerly we find that the values of the q’s minimizing (35) are, 
say, i ee 
= , H+; Y; 
eae (36) 


The relative minimum of x? will be 


Ls a wae 1 — el 
SP=S?+8,?= & (§ art D yygP—— (ny +n; 7): (37) 
i=1 \j=1 j=1 mM 
Comparing (36) and (37) we obtain 
mig igh | ya 
(= - G-90] ~ 
SS (38) 
3 aM 
a 


Finally ¢ will be found as a ratio of (34) and (37). The Tables of the Incomplete 

Beta Function should be entered with g=4 and p= 3( x n;—-m— 1) ; 
i=1 

Suppose now that having applied the test we reject the hypothesis that 
6=A=0. The question then arises as to what may be the actual value of A. 

According to formula (20), the best linear estimate of A is provided by the 
expression similar to 0, in which instead of the unknown parameter A the value 
of g)° should be substituted. Thus 


u" 


™ af . 
mene (%;—Y;) 
F=q,0=——_1+—_,—_ (39) 


is the best linear estimate of A. It is known that if the number of observations is 
not very small, it must follow very closely the normal law of frequency. The 
variance of F’, according to formula (21), is estimated by 


Som So Ff 
FF n—s 8,2 
ee Ne: 2 
m [ni ni” 1 (= a @-7)) 
i= — = 
> (§ HP + LU YP —— (nz, +n, ye) - + 
i=1 \j=1 j=1 nN; pee 
= | a . (40) 


m Ma! 
Ns N,; 

(= n—m—1) 5 st 

i=1 i=1 1 


PALMER O. JOHNSON AND J. NEYMAN 69 


It will be seen that the part of the above expression, which may be influenced 
by the arrangement of the experiments, is 


* a 1 4 
She . N,N," " ( ) 
i=1 1; 


Thus the accuracy of the comparison may be improved by making the sum 


& (n,'n;"/n;) as large as possible. Very often the value of this sum will be entirely 


determined by the conditions of the problem. Its nature and the observational 
material available may fix both the number of classes and the composition of 
subclasses, and thus the numbers n,’, n;” and n;. But in other cases, namely 
when we have to deal rather with an experiment than with the process of obser- 
vation, both the number of classes and their composition may depend upon the 
experimenter, only the total number of individuals used for the experiment 
being limited. There are also possible some intermediate situations between those 
two extremes. Our choice among possible arrangements of the experiments may 


m 
be determined from a consideration of the sum & (n,'n,"/n,). 
i=1 


As a matter of illustration we shall discuss the position when the only limitation 

™m 
concerns the total number of individuals used for the experiment, say N= X n,. 

i=1 
We shall do so in two steps. First we shall assume that the number of classes, m, 
and the numbers of individuals to be included in each class, or ,’s, are fixed, and 
we shall consider the composition of any pair of subclasses which will give the 
largest possible value of the quotient (n,’n,")/n;. Denote by k; the value of n,’ 
which maximizes (n,'n,")/n,;. If there are two or more such values of n,’, then let 
k, denote the largest among them. We shall have 


(k;—1) (n,—h, +1) < kh, (n,—h;) > (hy +1) (n;—k;— I). (42) 
Solving these inequalities with regard to k;, we get 
n,—1 n,+1 
= <h;< 57° (43) 


If n; be even then kh; = 4n, and the maximum value of (n,’n,’’)/n, is 3n,. If n; is 
odd then k;=4(n;+1) and the maximum required is }(n;—1/n,;). We may thus 
conclude that the most satisfactory composition of classes is that in which the 
sizes of two subclasses are as nearly equal as possible. Now assume that the 
classes are subdivided into subclasses according to the above principle and con- 
sider how the accuracy of the experiment is influenced by any change in the sizes 
of particular classes. We have 

py. " 3) Ig i 
EMME = n+ (n, -)=1N-2"S, (44) 
where the sums »’ and &” refer to classes with even and odd numbers of indi- 


70 Linear Hypotheses and Some Educational Problems 


viduals respectively. It will be seen that the size of (44) depends only upon classes 
in which the number of individuals n, is odd. If there are no such classes then the 
term (44) has the value }N, whatever the particular values of n,’s, provided all of 
them are even. In other cases the value of (44) is smaller. It will be seen that the 
rule to obtain the greatest possible accuracy consists in using an even number of 
individuals and in arranging classes so that subclasses should be equal in size. 
The number of classes does not influence the accuracy of the comparison in any 
way. This of course is a conclusion which is valid only under the conditions of the 
problem stated above, that the unknown value of A is the same for every class. 

The above theory refers to the generalized matched groups in which the sizes of 
classes and of subclasses are arbitrary. In order to obtain formulae referring to 
ordinary matched groups it will be sufficient to substitute everywhere n,;’=n,;'’=1 
and 7,;=2. 


(b) Numerical illustrations. 

Let us assume, for the purpose of illustration, ina) we have students entering 
university or college from several different types of secondary or preparatory 
schools, e.g. Junior-Senior High Schools, Private Secondary Schools, Regular 
Four-Year High Schools, Junior Colleges, Normal Schools. These students come 
from several occupational levels, e.g. professional workers, business and clerical, 
skilled workers, semi-skilled, and unskilled workers. Both sexes are represented. 
It is desired to compare the academic achievement of students of opposite sex. 
For this comparison we assume it necessary to compare students coming from 
the same occupational group and from the same type of secondary or preparatory 
school. These two criteria constitute the bases for obtaining matched or parallel 
groups. The “gains” made by these students in a course in biology as determined 
by the administration of pre-tests and final-tests are given. These gains are denoted 
by « for males and by y for females. The gains and the classification of 35 indi- 
viduals are set up in Table I. 

We see that in this case the composition of classes and subclasses is completely 
determined by the nature of the problem and of the observational material. We 
merely classify the individuals according to their social level and type of school. 
As the comparison refers to sex we cannot make the subclasses equally numerous. 
In fact we have simply to apply the formulae deduced to the data of Table I as 
they stand. 

Looking at the formulae (34) and (37) we see that we require to calculate (1) 
the sums of squares of all the x’s and y’s; (2) the sums of both 2’s and y’s for each 
class, or ”,'%;+,'"Y,;; (3) the means of %; and ¥; for each class; (4) the ratios 

n;'n;"')/n;- These values are set up in the lower part of Table I. Substituting 
them into the formulae (34) and (37) we get 


S,2=1069-9, S,2= 1492-2. (45) 
Hence C= S,?7/8,2=-7168. (46) 


PALMER O. JOHNSON AND J. NEYMAN 71 


Since &n,;= 35, m=5, r=1, we have to enter the Tables of the Incomplete Beta 
Function with g =}, p= 14-5. The table does not contain the column corresponding 
to p= 14-5. To get the value of p it would be necessary to interpolate. However, 


it will be seen that 0014 < P< -0026, (47) 
which will be probably sufficient to reject the hypothesis tested, the probability 


of an error in doing so being less than -0026. 


TABLE I. Gains in Biology of 35 students, males and females, classified 
according to social level and preparatory school. (Fictitious data.) 


Class Cy Of, C; Cy C; Totals 
/ a r ” 7 ” ” aye uv 
Subclass CORPO NECTAR OP IOP NG P| OPP 
Sex M F M F M F M F M F 
Student ae y % y aw y x y fy y 
No. 1 25 30 16 29 9 23 28 40 22 28 _- 
No. 2 24 39 27 31 23 27 35 16 40 — 
No. 3 34 43 20 —_ 26 32 — 44 38 — 
No. 4 1 ean ee | Ba kc gh eg Ie = 
No. 5 ge se Ot ea IS a al ld a ee ak 
Sum of 
squares | 5097 | 4270 | 1826 | 1802 | 698 | 2254 | 4387 | 2825 | 2676 | 3828 | 29663 
Sum of Ist 157 112 84 60 44 104 147 75 82 106 — 
powers 269 144 148 222 188. — 
eee rea) slg Sb. gid 5 lv Bae hel Qiahiqu oleaeBalile BB 
Mean 31-40 | 37-33 | 21-00 | 30-00 | 14-67 | 20-80 | 29-40 | 37-50 | 27-33 | 35-33 — 
&;-9; — 5-93 — 9-00 — 613 — 8-10 ean 00 — 


(n,’n,")/n; 1-87 1:33 1-87 1-43 1-50 8-00 


—11-97 


Pie eat: — 11-57 — 12-00 ee 


Therefore, if in any actual case we should get results similar to those in the 
above example, we should state that there is a real difference in abilities of the 
two sexes. In order to estimate this difference we turn to the formula (39), which 
gives "us ? = — 7-26, (48) 

The variance of F' as estimated by (40) has the value 

vp=4:60, Vop=215. (49) 

We may conclude therefore that the true value of A is probably negative in 
sign, thus suggesting that the achievements of females are on the whole somewhat 
better than those of males. 


72 Linear Hypotheses and Some Educational Problems 


Ill. MorE GENERAL APPROACH TO THE SAME PROBLEM 


(a) Statement of the problem. 


In the preceding section we have considered the case where the principle of 
classification of individuals was purely qualitative. There are certainly instances 
when this is actually the case. However, very generally the characters which 
serve for matching are measurable. These are such characters as (1) general 
intelligence, as measured by intelligence tests, given in terms of intelligence 
quotients, point scores, mental ages, (2) chronological age, (3) initial status with 
respect to the character under consideration, (4) percentile rank in graduating 
class, i.e. for college entrants, (5) scores on various aptitude tests, etc. 

The treatment of cases where the characters used for matching or classification 
are measurable may be exactly the same as in the case where they are not. How- 
ever, it is possible to suggest another method whith is likely to be both more 
convenient in handling the material available and based on some less rigid assump- 
tions. For both reasons it is likely that improved accuracy will result. 

Weak points in the matched groups method are as follows: (a) There are diffi- 
culties in selecting individuals which could be considered as ‘“‘matched” or as 
belonging to the same class. If there are several characters which are used for 
matching, then out of a considerable class of students we may often select only 
several couples which could be considered as “‘matched’’. (b) Another difficulty 
is connected with the fact that the difference, previously denoted by A, in the 
mean effect of two sets of educational conditions A and B, which it is desired to 
study, probably depends in most cases upon some character which has been used 
for matching. 

Both these difficulties may be easily overcome if we assume that the regression 
of the experimental character, which serves for the comparison of the conditions 
A and B, on the basic characters of matching may be represented by any particular 
formula. It will be seen that if this assumption is admissible, then the necessity 
of matching individuals disappears and also the estimate of the difference in 
effects of the conditions A and B ceases to be a rigid one and may refer separately 
to different classes of individuals. 

It will be further noticed that, if the matching of individuals is unne-essary, 
we shall be able to use for the comparisons all individuals available, whether their 
characters differ very much or not. This will result in increased accuracy. 

The limitation concerning the form of the regression equation is that it should 
be linear with regard to the constants involved. 

Before entering into details we shall have to fix an appropriate notation. 
Formerly we required only a notation for the experimental character of in- 
dividuals which it was assumed could be influenced by the conditions A and B 
compared, and for the values of this character in particular individuals. Thus we 


PauLMER O. JOHNSON AND J. NEYMAN Ue 


used letters X for the character itself and x and y for its values in conditions A 
and B respectively. 

Now we shall have to denote both the experimental character and its values 
and also the values of several other characters of the individuals, which it is 
assumed may influence both the experimental character and its reaction on the 
conditions A and B, and which we described as basic characters of matching. 
Matching characters will now play the rdle of independent variables and the 
experimental character that of a dependent variable. There are no limitations on 
the number of independent variables. However, for thesake of simplicity, we shall 
assume that there are only two, which will be denoted by x and y. The values of 
the experimental character for individuals subject to the conditions A and B 
will be denoted by z and u, respectively. 

Our basic assumptions may be expressed by saying that the population means 
of z and u of individuals with fixed values of x and y are some functions (not 
necessarily the same) of these values x and y, i.e. 

E(Z)=fi(z,y), &(u)=fe (x,y). (50) 

Our purpose is 

(i) To test the hypothesis, say H (x’,y’), that for some particular system of 
fixed values x=2’, y=y’ the-two means are equal, or 


fy (2's y')=fo(2'sy’)- (51) 
This, if true, would mean that the difference in conditions A and B does not 


influence the average value of the experimental character of individuals in which 
z=2 and y=y’. 


(ii) If the above hypothesis H (x’,y’) proves unlikely, then we shall have to 
solve another problem, consisting in estimating the difference, say 


O(2',y')=filz',y')—fo(z'sy’). (52) 
The functions f, and f, may be chosen arbitrarily, according to the conditions 
of the particular problem, with the only theoretical restriction that both functions 
f, and f, must be linear with regard to the unknown constants they involve. 
Therefore we could assume that, e.g. 


fi (@, y) =A t+ A, cos ry + Anesin®, (53) 
but our method would fail if it were necessary to assume 
fi (zy) =cos A, r+ A, (54) 
since here the dependence upon A, and A, is not linear. 
(b) Hypothesis H (x’,y’). 


In many practical cases it will probably be assumed that both f, and f, are 
polynomials with regard to all characters of matching. Often it will be assumed 


74 Linear Hypotheses and Some Educational Problems 


that these polynomials are of first order. This is the case which we shall discuss in 


detail. Thus we shall write & (2)=A,+4A,2+Aoy, _ (55) 


6 (u)=Bo+ B,x+ Bay. (56) 
The coefficients A’s and B’s will play the réle of the parameters p mentioned 
in Section I. The hypothesis, H (x’, y’), to test will be expressed by one equation 


6(x’, y’)=Ay— By+ (A,— B,) x’ +(A,— By) y'=0. (57) 


We shall denote by , and n, the numbers of individuals subject to conditions 
A and B respectively. Further on we shall have to denote summations extending 
over all individuals subjected either to A or B. They will be denoted by X, and 2X, 
respectively. 

It should be remembered that it is taken for granted that, for any fixed values 
of x and y, the distributions of z and wu are normal about the means (55) and (56) 
with the same, though unknown, variance, ‘- 

We now proceed to the solution of the problem. The sum of squares x? to be 
minimized may be written 


x= Ly (Z— Ay — 4% — Any)” + Uz (U— by — b,x — boy)?, (58) 
where the a’s and b’s are continuous variables, corresponding to the q’s of the 


theory explained in Section I. Their values minimizing (58) will be denoted by 
a°’s and 6°’s. We have 


Ga 

sae — 224, (%— Ay — 4, X — a3) 

ox? 2 

ne — 22, (2% —ayu —a, 2x2 —ay,xy) >. (59) 
Ox? 


oe mee 22, (2Y — Ay — A, LY — Agy”) 
2 


Equating these expressions to zero we get familiar equations to determine the 
partial regression coefficients of z on x and y: 


Ay? + 449%, +A,°Y, = Z | 


] 1 1 
a,°%,+4a,° —Z, 27+0a,°— +, vy=— 2, x2 
0 ty Ln, ‘A zn, Avy ny A 


Ny 


= 1 1 1 
Ay Yy +49 Dy VY +,9—— Dy? =— Ly yz 
my My 
After some easy algebra the two last equations reduce to 
B49 Og +4 Tn, Oy = xe, 
A ~ s i 2 x: i; (61) 
Oy Pay On +5 Oy’ = Tye Fy 
where Z,, 7, and Z, 7’, 7’s, o’’s and o, are the means, the coefficients of correlation 
and the s.p.’s calculated for the group subject to the conditions A. o’s and r’s 
with two dashes will refer to the other group of individuals. 


PauLMER O. JOHNSON AND J. NEYMAN 75 


Similar equations will be obtained by equating to zero the derivatives of ? 
with regard to by, 6; and b,. The absolute minimum of x? is readily obtained from 
the theory of multiple correlation, thus 


= ni eee, a>: 
he ee ray Tae et ey Mae" yz 
z 
=e 
173 2 2 Ae 
=f, Vi Tyg ely Vex 
2 ry zu yu xy zu’ yu 
+3 Sy l—r..”2 ° (62) 


We may now turn to the calculation of the relative minimum S,? of x? subject 
to the condition that 


(dy +4, %' + ayy’) — (by +b 2’ + bay’) = B=0. (63) 
For this purpose we differentiate the function 
b= x? — 20 {(ay +a, 2' + ayy’) — (by +b, 2' + bay’')} (64) 


with regard to all variables a; and b; (i= 0,1, 2) and equate the derivatives to zero. 
After some easy transformations we get the following equations, determining 
the values, say @ and 5, of the variables ascribing to x? its relative minimum value 


Pee ere a” Sree 
ae tees (65) 


Gy, + stony batt, Banya Dire + Sa! 
(66) 


Multiplying the first equation alternatively by %, and 7, and subtracting from 
the following ones and then dividing the results by a,’ and a,’ respectively, we get 


=> _ 'ys TAS ea ae ae 
AF, +Agh zy Sy =1z2Fz a 72 
. 1 


(67) 
By" ny Fg +Agoy = Vyz ot a 
Similarly we shall get 
ts tae: 
bo +b, %.+ bya t—— (68) 
7 ut ar a 
and 66) 469.505 =Tau Fu — b 
- 9 (6S, 
byte os 4+ By ay” Tyu® a la 
2 
xv’ —% "-¥ 
where é=—, a Lay (70) 
Cy Cy 


and the significance of £, and 7, is a similar one. 


76 Linear Hypotheses and Some Educational Problems 


The equations (65) and (66) being analogous to (60), it will be seen that the 
solutions of (66) may be put into the following form: 


= a = He 

By cent sila (71) 
19 Ning 
a —£,7,,, 

d,=A29 + a =f = (72) 
1°%y Voy 


Using these expressions and also (65) and the first of the equations (60), we get 


Ay +Q,2'+a,y' =a)°+4,°x' +a,°y' +aP, (73) 
2.9 2) 
where pa> (148 TevEv"h + ). (74) 
Ny If, 
Similarly it is easy to obtain 
By +5, 2' + boy’ = by? + b,x" + b,°y’ — a Q, (75) 
‘ > 
1 2 — Ory” 2 
where Oz (4% Vey 2712 +72 ). (76) 
No aie 


It follows from (58), (74) and (76) that 
(a9 + a,°x' +.a5°y’) — (bg? + b;°x' + b,°Y’) Fi(z',y') 
= =e aire: 
“P+Q P+Q (say). (77) 
It will be noticed that the expression in the numerator of (77), denoted by 
F (x’, y’), is the best linear estimate of 6 (x’, y’) in (57). 
Using the formula (17) of the introductory section we may now write 
8,2 = (Zz? —a, Lz —G, Laz — A, Lyz) + (Lu? —b,Lu—b, Lau—b,dyu). (78) 
Using the formulae (71), (72), (65), and also the equations (60), we obtain 
from (78) 


S,2 = Xz? — ay°Lz — a,°dxz — a,9DXyz + Du? — b,°Xiu — b,)°Xxu —b,-Lyu—ak. (79) 


Taking into the account (77) and (12) of Section I, we obtain finally 
F 


ae ee 
S, SP+ pro: (80) 


where, using (77) and the formula for multiple regression, 


ey aa ree) Torn — Toray Pere 
F=( M4 xY YZ z 2 Yo xy a2 Te ne 
bi a il ~ ie oO,’ (x 2) ue 1 a rt oy (y n)) 
i ee ee om oy Pat Tyu en Cy pr = ) 
— UW oo — 
( 1g a (a —%)+ ea ey) (81) 
and 
] 1 x’ —Z,)? a —% aes Dry SP 
P+Q=— {14 =f A) op Pa’ Hi) , YH) ) 
Ny ei a eto ee Go, oO, chet 
] ] x —Z.,)2 ao ITF RT) od aN 
ks los val a) op wh %) (y Ho) Y BX) (92) 
Ne | Li) CO, C4, Cae 


PALMER O. JOHNSON AND J. NEYMAN rir 


Dividing S,? by S,? we shall obtain ¢, after which it will be necessary to refer 
to the Tables of the Incomplete Beta Function. This should be entered with 
P=} (n,+n2—6) and g=}. 


(c) Details of the experiment influencing the accuracy of results. 


The formulae (80), (81) and (82) lead to the following remarks. 
Consider the criterion by which the hypothesis H (z’, y’) is tested, 


C= 82 /8P= (1+ P{(P+ Q) 8.73). (83) 
The hypothesis H (x’,y’) will be rejected whenever the criterion { is ‘“‘too 
small’. Now the value of this criterion depends upon three numbers F (2’,y’), 


S,? and P+Q. Here F(x’, y’) is the estimate of the unknown 6(z’,y’), and S,? 
that of the unknown variance o? multiplied by n, +, — 6. Therefore the values of 
F and S,? are outside the control of the experimenter. On the other hand, P + Q 
depends only upon 2’s and y’s (not on z and u) or upon what has been called the 
basic characters of matching. If in a particular problem we are dealing with an 
experiment in which the values of the basic characters may be chosen in advance, 
we may use this circumstance so as to increase the stringency of the test. It will 
be seen from (83) that for any given values of F(x’, y’) and S,? the value of € is 
diminished when P+Q is diminished. Therefore the stringency of the test, 
i.e. the chance of detecting the fact if the hypothesis is not true, is increased by 
making P+ Q as small as possible. Looking at the formula (82), we may deduce 
the following rules of improving the stringency of the test. 


(i) If it is desired to test the hypothesis H (zx’, y’) only for one particular system 
of values x’ and y’, then the greatest stringency is obtained when the means of 
the x’s and y’s in both groups of individuals subject to the conditions A and B are 
exactly equal to the fixed numbers x’ and y’. Then P+ Q attains its minimum 

1 ee ’ 
value —+—. 

Ry Ns 

(ii) If there are difficulties in attaining the absolute equalities 
%=F,=2', J,=f.=y', (84) 

and indeed if it is desired to test the hypothesis H (x’, y’) for more than one system 
of values of x’ and y’, then it becomes highly important to make r,,’ and r,,," as 
small as possible and the s.D.’s of x and y in both groups as large as possible. 


(iii) Whatever may be the case, it is desirable to have n,=n,. 


It will be noticed that the above three rules describe the amount of “ matching”’ 
which is important from the point of view of accuracy. It follows that any other 
attempts to make the two groups more equalized would not result in any gain in 
precision. On the other hand, as such attempts will almost invariably tend to 
diminish the number of individuals compared, they should be considered not 
only as useless but also as harmful. This, of course, applies to the statistical 


78 Linear Hypotheses and Some Educational Problems 


problem considered. In some other cases the desirable amount of matching may 
be different. 


(d) “‘ Region of significance.” 

This section is concerned with the case where it is desired to discriminate 
between such systems of values x’ and y’ for which the hypothesis H (z’,y’) 
should be rejected and the remaining ones. It will be convenient to represent 
each system of 2’ and y’ (or each hypothesis H (x’, y’)) by a point on a plane with 
coordinates x’ and y’. Generally it will be possible to determine a region #, which 
we shall call the region of significance, such that the hypothesis  (x’, y’) repre- 
sented by any point of this region should be rejected. 

The region of significance is found as follows. Denote by « a number such that 
if the value of P found from the Tables of the Incomplete Beta Function is less than 
or equal to «, we should be inclined to reject the hypothesis tested, H(z’, y’), 
whereas if P > «, then we should abstain from rejecting H (x’, y’). This number «, 
which may be arbitrarily chosen by any research worker, e.g. «= -05, «=-01, etc., 
will be called the level of significance. Denote by ¢, the value of ¢ corresponding 
to the chosen level of significance. Now the region FR will be defined by the 


inequality S2<t, 6,3, (85) 
which may be reduced to the following: 

(9° — bg? + (1° — by°) &' + (2° — 2°) y’)? 1 a 

a 
-| (Ha ar (CaS hl wean) 
MAF, 7) Coz? oY Go, a, Gri 
, Lar 2 , Re ran {? pe dey, , ery, 2 
1 : ( ;) ut (a 2) (y Yo) n (y ¥:) ge m+ “| S220. 
CL (lt it ed Ih ae ie oon o Ny Ng 


(86) 
Having this inequality, it will be easy to determine the region F in any par- 
ticular case. It will be remembered that here the only variables are x’ and y’. 
The curve limiting the region of significance, say L, will be a quadratic. 
It may be interesting to draw some general conclusions concerning R. For this 
purpose let us write the formula (86) in the following short form: 


F? (x', y')w—(P+Q) 8,720. (87) 
It will be noticed that the expression (P+ Q) S,? is always positive with a 
minimum, say H, which may vanish only if 7, =%, and 9, =7,. It will be seen also 
that the curve corresponding to the equation 
(P+Q) S,2=any positive constant (88) 
is an ellipse, say HL, with its centre at the point in which P+ Q is minimum. This 
point will be called the centre of accuracy and denoted by C,. This is the point in 
which the variance of F attains its minimum value. 


PALMER O. JOHNSON AND J. NEYMAN 79 


Consider now the locus of points (x’,y’) in which the estimate, F(z’, y’), of 

6(x’, y’) is equal to zero. This corresponds to the equation 

F(x", y’) = a9? — bo? + (a1 — 5,°) x’ + (a2° — 6°) y’ =0 (89) 
and is a straight line, which we shall call the line of non-significance and denote 
by F. 

It will be seen that the quadratic, L, limiting R does not cut F. In fact in each 
point of / the left-hand side of (87) is negative. Thus the whole line F lies outside 
the region R. 

Now let us transform the coordinates, taking as new axis of ordinates, say On, 
the line of non-significance F and as the axis of abscissae, O£, the diameter of the 
ellipse H, conjugate with the direction of F. It is immaterial which direction of 
the new axis of ordinates will be considered as positive. As to the direction of the 
new axis of abscissae, we shall assume that it is determined by the sign of F (x’, y’) 
in (89). The inequality (87) will now have the form 


whé?— A (€—£,)?— Cy? -H>0, (90) 


k?, A, C and &, being independent of £ and 7, and €, meaning the new abscissa 
of the centre of accuracy C,. 


If wk? A, then the inequality ik may be written in the form 


2 

and the region £ is lying either inside of an ellipse or is limited by a hyperbola. 
The line L is an ellipse whenever wk? < A. This ellipse will be imaginary if the 

right-hand side of (91) is positive. In such a case there will be no hypothesis 

H (x',y’) which should be rejected. In other cases it will be a real eS with the 

centre at the point, say M, with coordinates 


A 
{= oe 1 =0. (92) 


(91) 


It will be seen that ¢, has the same sign as €, and that | €,| >| |. It follows that 
MV lies on the new axis of abscissae, so that the centre of accuracy C, is falling 
between M and the intersection of the axis of abscissae with the line of non- 
significance. 

In the case when wk? > A, the curve, J, limiting F is a hyperbola. 

Finally, if wk? =A, the quadratic L is a parabola. 

It will be seen that the centre of accuracy C, may or may not lie within the 
region of significance. It is interesting to note that the region of significance may 
be composed of one or of two parts. In the latter case we reject the hypotheses 
H (x',y'), stating that 6 (z’, y’) =0, in favour of the alternative hypotheses, some- 
times assuming that 6 (x’, y’) is positive and sometimes that it is negative, accord- 
ing with the sign of F’. If the region of significance is composed only of one part, 


80 Linear Hypotheses and Some Educational Problems 


then whatever x’ and y’, we shall either accept the hypothesis tested or reject it 
in favour of only one kind of alternative. 

Details as to the practical computations are given in the part (f) of the present 
section. 


(e) Other hypotheses. 

The test of the hypothesis H (x’,y’) may be usefully supplemented by some 
further analysis of the data. It may for instance be asked whether the coefficients 
of partial regression of z and wu on x are the same or not. The same may be asked 
with respect to the regression on y. Further, we may ask whether it is possible 
to admit that the two coefficients of partial regression on x and on y in both 
regression equations 6 (z)=A,+Aya+Aey (93) 
and é (u)= By + B,x+ Boy (94) 
are equal, i.e. both A,= B, and A,=B,. Finally we may sometimes know for 
certain that A, = B, and A,= B, and may wish to test the hypothesis that A,= By. 

It would be perhaps too long to consider all these problems in detail. But it 
seems to be useful to give the final result in the form of the proper criterion to 
calculate and some indications of how to enter the tables. 

Hypothesis H {A,=B,}. In order to test the hypothesis that one of the partial 
coefficients of regression, e.g. on x, has the same value in both equations (93) and 
(94), calculate S,? as given in (62) and S,? according to the formula 

Ny Ne, "Ox (1 = je) (1 = fore) 
Ny G2 (1 ian ie) =F Ne Ca (1 Ta tase) 

Calculate €= S,?/S,? and enter the table with p=3(n,+n,—6), g=4. 

Hypothesis H {(A,=B,)(A,=B,)}. In order to test the hypothesis that both 
A,=B, and A,=5,, calculate S,? as given in (62) and S,? according to the 


(a,°—b,°)?, (95) 


S2=S8,2+ 


formula 
S 2—¥ Re Topp iri beige rt oy s apr (96) 
1 Zz Liye bety tse > 
where 5 2n os mm ~—?p os Seant 
ge N10, “+N2d, “; ay = (U4 XY — My XY) + (UprY — Ny FoHo) | (97) 


L,2 =, 0,7 +%.0,,7, P= (Laz — 0, X,2Z) + (Leu — NnyXuU) J 
and other letters have similar meanings. Calculate (= S,?/S,? and enter the table 
with p=4(n,+n,.—6), g=1. 

Hypothesis H {(Ay= By) | (A, = B;) (A,= B,)}. Assume now that it is known for 
certain that 4,= B, and A,= B, and that it is desired to test the hypothesis that 
A,y= B,. The absolute minimum 8,” of x? will now be identical with S,? as given 
by the formula (95). S,? will have the form 


lk, hw Rae + on, hy 
ae = (ny + No) a,” cd = = R,, xy ~TYyZ~ "xz ‘ (98) 
where o,° means the standard deviation of the n, +7, observed values of both z 


and u combined together and the R’s the coefficients of correlation, each being 


PALMER O. JOHNSON AND J. NEYMAN 81 


calculated from n,+7, pairs of values of the variates for the two groups of 


individuals combined, the difference between variates z and u being disregarded. 
E.g. 


(n+ Me) (2X 4x2+ Lpxu)—(L,x+ Uzx) (D,~2+ Ug) 


LE 


V{(m + Ng) (D427 + LyX?) — (Ly e+ Uy x)?} {(my + Mg) (Dy 22+ Uyu®) —(Uyz+ UZ, u)%} 
99 
The table should be entered with p = }(n,+n,—4), =. ie 

Hypothesis H {(Ay= By)(A,=B,)(A,=B,)}. Finally we may test the hypo- 
thesis that both regression equations (93) and (94) are identical without assuming 
that we have any a priori information concerning any of the coefficients. In this 
case S,? will be calculated from the formula (62) and S,? from (98); 

P= }(Ny+N_—6), Q=1'5. 
(f) Practical application of the preceding theory. 

The example presented here for illustration of the more general approach to 
the problem of “‘matched”’ groups is directed at determining sex difference in 
achievement in biology. There are two criteria for determining the relative 
abilities of the two groups, each of which has been shown to bear a relationship 
to achievement, (1) rank in high-school graduating class, converted to standard 
deviation form with a mean of 50 and a standard deviation unit of 10, (2) initial 
score on the achievement tests. The measures of achievement consisted of tests 
for three objectives: (1) the acquisition of technical vocabulary, (2) the acquisition 
of important biological principles and other factual information, (3) the ability 
to apply a knowledge of biological principles in the solution of biological situations 
or problems.* The examinations were administered at the beginning and at the 
end of a course in biology given in the General College of the University of 
Minnesota.+ The examination testing the acquisition of vocabulary consisted of 
100 items arranged in units, one of which is presented below. for illustration. 
Specific directions. You are to place on the blank line to the left of each definition or 


description the key number of the correct item selected from the list of terms arranged in 
alphabetical order. 


Key no. Key no. Key no. Key no. 
1 amboceptor 5 bacteremia 9 hypersensitivity 13 serum 
2 antibody 6 bacteriophage 10 lysin 14 symbiosis 
3 antigen 7 catabiosis 11 phagocytosis 15 tuberculin 
4 antitoxin 8 exotoxin 12 plasma 16 vaccine 


Definition or description 


— ]. An ultramicroscopic parasite of bacteria. 
—— 2. The ingestion of microorganisms by cells. 
— 3. A general name for substances in the blood serum that tend to counteract the effect of 
germs. 
* For the construction and properties of these examinations see references in footnote, p. 63. 
+ These data were collected in an investigation under way by Clara Brown and Palmer O. 
Johnson dealing with a determination of the relative values of two different courses in biology in 
preparing students for certain sequent courses for which biology is a pre-requisite. 


PSM 6 


82 Linear Hypotheses and Some Educational Problems 


. The liquid part of clotted blood. 

A dead or attenuated antigen. 

A general name for any substance that causes the formation of antibodies. 

. The liquid part of unclotted blood. 

. A substance contained in certain serums which when introduced into the blood stream 
neutralizes bacterial poisons. 

—— 9. An antibody which dissolves cells. 

— 10. A condition of exaggerated response to certain substances not exhibited by normal 


people. 


IS OU 


The examination testing the acquisition of biological principles or other 
factual information consisted of 50 items. Illustrations of the type follow. 


Specific directions. In the following exercises you are to select from the choices within the paren- 
theses the one that correctly completes the statement. Cross out the letter corresponding to this 
choice in the column at the left. 


abede Sex in man is controlled by the (a, autosomes; 6, genes; c, chromomeres; d, accessory 
chromosomes; e, centrosomes). 


abcde The development of every characteristic of an organism depends upon (a, the inheritance 
of acquired characters; 6, continuity of the term plasm; c, interaction of heredity and 
environment; d, organic evolution; e, proper fertilization). 


The examination testing the ability to apply a knowledge of biological prin- 
ciples or other factual information to the interpretation of problems and situations 
consisted of 50 items. Two illustrations follow. 


Specific directions. Each of the exercises below begins with an introductory statement or the 
description of a situation. Then follows a list of statements or inferences. You are to select the one 
which seems to you the most logical and most complete interpretation of the recorded facts. Cross 
out the appropriate letter in the column at the left. 


abcde Without sexual reproduction, evolution has never proceeded beyond the production of 
relatively simple organisms because: 


a, Chromosomes are not found in any but the cells of higher organisms. 

b, Without sex, mutations are impossible. 

c, Sexual reproduction enhances hereditary variability by bringing together characters 
which have originated in a number of separate lines instead of merely one single line. 

d, Sexual reproduction makes for the production of a far greater number of individuals, 
thereby increasing the chance of possible different characters. 

e, Sexual stimulation is essential to evolutionary processes. 


abcde A brain tumour was removed from a young boy. When he grew to be a man, he was killed 
in an automobile accident. At autopsy, the physician saw that: 


a, New nerve cells had replaced the removed cells. 
b, The brain tissue regenerated in large amounts. 
c, No new brain cells were to be found. 

d, Complete cortical degeneration was found. 

e, An increased number of gyri were present. 


As has already been mentioned all students have been submitted to the same 
examination twice: before and after taking the course in biology. In each case the 
papers were scored, the score representing the total number of questions answered 
correctly, so that the highest possible score was 200. The data obtained concern 
145 boys and 81 girls. They are given in Tables III and IV at the end of the 


PALMER O, JOHNSON AND J. NEYMAN 83 


paper, in which « means the percentile rank in the high school graduating class, 
y the initial score on the examination and z (for boys) and w (for girls) the final 
score of the same examination. 

The problem of comparing the abilities of the two sexes requires the taking into 
account of a number of factors influencing the examination scores. We think it 
a minimum requirement to consider both the initial and final scores and the 
percentile rank in the graduating class. 

The final score must be influenced by the two other variables. Generally the 
higher the initial score the higher must be the final one. On the other hand, if the 
initial score is very high, the gain could not be very great. Again, if the initial 
score is rather low while the percentile rank in the graduating class is high, we may 
assume that the individual is able to acquire knowledge, but did not succeed in 
doing so owing perhaps to a relatively poor teaching at the high school, lack of 
interest, industry or some other trait, or he may not have taken biology in high 
school. If he lives up to his potentiality as a student a relatively high final score 
may be expected. 

For the above reasons a valid comparison of the abilities of two sexes is only 
possible when we compare the final scores of individuals having the same status 
in the graduating class and the same initial scores. 

The problem could be perhaps treated by the orthodox matched group method. 
However, we did not think it advisable for the following reasons: __ 


(1) When attempting to select groups which could be considered as matched 
according to the basic characters x and y, we shall be able to select only a very 
small number of individuals. That is, unless we make very. broad groups, con- 
sidering as equal such values of x and y which in reality differ considerably. 


(2) The difference in the ability of acquiring the biological knowledge between 
two sexes, if any, may depend on the values of the basic characters x and y and 
therefore should be measured, as it were, separately for each system of these 
values. 


We do not think it entirely correct to assume that theregressions of both z and u 
on # and y are represented by planes. On the contrary we see good reasons to 
assume that these regressions are skew. However, we assume linearity as a first 
approximation, hoping that in the future we shall be able to consider the problem 
more fully following the same method of approach. 

Using the data mentioned some frequency constants were calculated as given 
in Table II. It will be seen that on the whole the girls both ranked at the high 
school and scored at the university slightly better than boys. Is this because of 
a better ability or for some other reasons ? 

Naturally this question should be asked with respect to the boys and the girls 
who had equal ranks in the high school and who scored equally at the initial 


examination, i.e. with respect to boys and girls for whom x and y have some equal ~ 
6-2 


84. Linear Hypotheses and Some Educational Problems 


values, say x’ and y’. The nature of the problem does not suggest any particular 
system of these values 2’, y’. Therefore we shall try to determine all the systems 
of values 2’, y’ for which it would be reasonable to assume that the difference 
between two sexes in the ability of acquiring biological knowledge is a real one. 
The problem is thus reduced to the consideration of the hypothesis previously 
denoted by H (z’,y’) and to determination of the region of significance, R. 


TABLE II. 


Boys Girls 
Z, = 43-6761 #, = 49-9190 
7, = 20°8483 J = 22-6667 
2=47-7724 ui = 52-2346 
Z—9, = 26-9241 i —F. = 29-5679 
oc, = 8-6759 oc,” = 10-1310 
a,/= 9:1058 a,” = 11-1787 
o,= 11-2611 oy, = 10-6008 
Toy = 2739 Tny” = +2955 
i ———olod Try = °*4476 
Ty,= *B728 Tyy= *5424 


N= 81 


Let us fix the level of significance «=0-01. The total number of observations 
being n= 145+ 81= 226 and the number of parameters s=6, we see that the 
difference n —s=220. Therefore Pearson’s Beta Function Tables could not be 
used. However, we may use the supplementary table at the end of the present 
paper, giving the values of ¢.,, corresponding to several systems of r and n—s. 
In our case the hypothesis H (x’, y’) is expressed by only one equation, therefore 
r= 1. The table does not contain the value of n — s = 220, so that to get the appro- 
priate value of C.., it is necessary to interpolate. For this purpose, following the 
method invented by R. A. Fisher, we shall use as the independent variable not 
n—s but, say, 


N, = 145 


600 
oe (100) 
n—S 
It will be seen that the table gives values of ¢.., corresponding to t=0, 1, 2, 3, 4, 5 
and 6. For n—s=220 we have t= 2,8. Linear interpolation between the values 
corresponding to t= 2 and t=3 gives ¢.,=-9702, hence w= £9,/(1—£.9,) = 32°56. 
We shall now follow the method explained in pp. 70-80 and calculate separately 
all formulae which are independent of w. We have 


Gy + 49x" + ayy’ = 24-50194 + (-22232) 2’ + (-65030) y’, (101) 
bo + b,x! + b,°y' = 26-13373 + (-32936) a’ + (-42615) y’, (102) 
giving the estimates of the average final score for boys and girls respectively. 


PaLMER O. JOHNSON AND J. NEYMAN 85 
Subtracting, we obtain the estimate F of the difference 0 (a’,y’): 
F (x', y') = —1-63179 — (-10698) a! + (-22415) y’ (103) 
=a+bx'+cy’ (say). 
Putting F(x’,y’)=0, we obtain the equation of the line of non-significance. 


Next, using formula (62), we get 
S,2=17457-61, (104) 


Using formula (82) and expanding, we obtain 
8,2 (P+ Q) = (403001) 2’? — 2 (1-06742) x’y’ + (3-45955) y’2 
— 2 (167-00405) x’ — 2 (25-09422) y’ + 8631-57618 
= A’x'?4 2B'a'y' + C'y'2 + 2D'x' + 2E'y'+ H’ (say). (105) 
Equating the above expression to any constant we shall obtain the equation of 
the ellipse H, the centre of which is the centre of accuracy previously denoted 
by C,. The coordinates of the centre of accuracy, say 2%», Yp, are found from the 


equations A't)+ B'yj+D'=0, B’x)+C'y,+ E'=0. (106) 


In our example y= 47-2204, yy= 21-8232. (107) 


The equation of the diameter of ellipse # conjugate with the line of non- 
significance, say 


a, +b, 0" + O,y! = — 40-11854 + (-78913) x’ +(-13084)y’, (108) 
is obtained from the formulae 
 a,=D'c—E'b, b,=A'c—B'b, c= B'c-C'b. (109) 


If we take the line of non-significance F (x’, y’) =0 for the new axis of ordinates, 
O,7, and the diameter (108) for the new axis of abscissae, O,é, then according to 
formula (90), the boundary, JL, of the region of significance, R, will correspond to 


the equation wké2— A (€—&)?— Cr? -H= 0. . (110) 
Known formulae of the analytical geometry give | 
(b,c —be,)? 
Pj Nie ee Se ec, 4 lll 
k bidet 056945, (111) 
EC |B ybie= Pa) 112 
A in = Pe + Cy / ( 
y= 7 (a+b + ey) = — 7:50834, (113) 
_b,e—bey _ ; 114 
C= meee 3-14527, (114) 
H=D'x)+ E’y)+H'. (115) 


It has been formerly stated (p. 79) that the shape of the region of significance 
depends upon the relationship of w and certain constants based on observational 


data. Thus if AH 


<2 (H+é2A)’ (116) 


86 Linear Hypotheses and Some Educational Problems 
then the region of significance does not exist at all. If 


AH A iz) 


12 (H4£,3A) =~" 


then the region of significance is inside of an ellipse. If wk?= A, then it is limited 
by a parabola, and finally, if wk? > A, then its boundary is formed by a hyperbola. 
In our case AH 

k? (H+ &,24) 
and the value of w appears to be less than this limit. The region of significance 
does not exist, and, as far as we desire not to risk errors in our judgments with 
probabilities exceeding -01, we have to conclude that the data available are not 
sufficient for us to state that there is a real difference between the average final 
scores of girls and boys, whatever be their position at the high school and what- 
ever their initial scores.* 

This applies to our actual data. For purposes of illustration, however, we shall 
consider what would be the position if our data were somewhat larger, so that 
n—s=300, say. We shall assume that all frequency constants have the same 
values as given in Table II, except for n, and n., which will now be proportionately 
increased so that n, = 196 and n,= 110, their sum being equal to 306. 

Formulae (62) and (82) show that if the only change in the data consists in a 
proportional increase of the n’s, then the product 8,?(P+Q) will remain un- 
changed; so will F(x’, y’). Therefore the only parameter in the equation of the 
region of significance boundary which will now have another value is w. Table V 
at the end of the paper gives ¢.), =-9781 and it follows that w= 44-66. Comparing 
this with (118), we see that under the present conditions the region of significance 
exists. In order to determine its shape we calculate A/k?= 67-072, and we see 
from (117) that the boundary of the region of significance is an ellipse. Substituting 
the values of constants calculated into (110), we obtain the equation of this 
ellipse as 


= 43-4541, (118) 


— (1:27609) (€ + 22-472)? — (3-14527) 12 = — 57-793. (119) 


The coordinates, x’ and y’, of any point within this ellipse represent the values of 
the percentile rank in the matriculation class and the initial examination score 
for which we should reject the hypothesis tested, H (x’, y’), i.e. for which we should 
state that the difference between average final scores of boys and girls is a real 
difference, not due to random variation. As the value of F (z’, y’) is in these points 
negative, we could assume that if two groups of boys and girls had x and y equal 
respectively to coordinates of points within the ellipse (119), then the girls would 
be likely to score higher at the final examination. 

Let us consider still another example when n, and n, are proportionately 


* However, if we take the level] of significance «=-05, then €.,=-9826, w=56-47 and the 
region of significance exists and is bounded by an ellipse. 


PALMER O. JOHNSON AND J. NEYMAN: 87 


increased still more, their sum being equal to 1006, all other frequency constants 
of Table IT remaining unchanged. Now n—s= 1000. 

Interpolating in Table V we should find £,, =-9934 and thus w= 150-5. Com- 
paring this value with that of A/k?=67-072, we conclude that the region of 
significance is limited by a hyperbola. The equation of this hyperbola is found 
again from (91), ice. 

4-75121 (€ — 6-0355)? — (3-14527) 72 = 759-7. (120) 

The position is illustrated in the diagram on p. 88. 

Shaded areas limited by an ellipse, Ds), and by a hyperbola, Ly 9), represent 
the regions of significance, corresponding to assumptions n—s=300 and 
nm—s=1000 respectively. In the first case the data would be sufficient for the 
rejection of certain hypotheses H (z’,y’) (namely those corresponding to the 
points within the ellipse D4o)) in favour of the hypothesis assuming that the final 
scores of the girls are generally higher. The data would not be sufficient for the 
rejection of any hypothesis in favour of the other alternative, assuming that the 
scores of boys are better than those of girls. 

On the other hand, if n —s were equal to 1000, then the region of significance 
would consist of two parts determining values of x’ and y’ for which alternatively 
boys or girls may be reasonably assumed to score higher at the final examination. 

We may now test other hypotheses discussed in the previous section. 

H{A,=B;}. This is the hypothesis that the partial regression coefficient of the 
final examination score of the percentile rank in the high school has the same value 
for men and women students. H {A,= B,} has a similar meaning, but concerns 
the regression coefficient on the initial examination score. 

For both hypotheses S,? is calculated according to formula (95), which gives 

S8,2=17507-2, C=8,2/S,2=-9972 for H{A,=B,}, (121) 

§2=17711-2, C=S,2/8,2=-9857 for H{A,=B,}. (122) 
Both values of £ exceed the value of £.o); therefore, if the probability of an error 
in judgment should not exceed -01, we should not risk stating that there is any 
difference between the two sexes in partial regression coefficients taken separately. 

It may now seem superfluous to consider the hypothesis H {(A, = B,) (A,= By)} 
assuming that both equalities A, = B, and A,=B, hold good. However, this is 
not so. In fact the data may be sufficient to show that there is a difference between 
the partial regression coefficients either between A, and B, or between A, and B,, 
this without being sufficient to indicate which of the two equalities fails to hold. 

8,2 is calculated from the formula (96) and we have 

S2=17717-9, C=S8,2/8,2=-9853. (123) 
Referring to the supplementary Beta Function Table and entering it with 
7 = 2, n—8 = 220, we find by an interpolation, similar to that described above, the 
value ..,=-9590. It follows that there is no reason to reject the hypothesis that 
both A,=B, and A,= B,. 


“WNVY SDULNADYAd IOOHDS HOIH = © 
06 oe OL 09 OS OV O€ O02 Ol O 


o0e 


“Ma 


’ . ® 

‘wa1i3a suis ‘o(h'x)®y 
‘0001 =» § -"U OL 
ONIGNOdS3YYOD 3ONVIISINSIS 40 NOIDzY 


Yip Pa 


ieee 
ALYZ 
BJONVIIZINDIS 4O NOID3y jf Yi 
LZ o9 


(09> WIOGYadAH) OOOl=S-U  GNV ‘(°F 3Sdi113)00€ = S-yY 
-SNOILMWASSV = SHL 
Ol SNIGNOdSSYYOD JONVIISINDIS 40 SNOIDSY SONIMOHS WyxXovId 


‘3YODS = NOLLWNINVX3 WILIN! = A 


PaLMER O. JOHNSON AND J. NEYMAN 89 


H {(A,= By) (A, = B,) (A,=B,)}. This is the hypothesis that the regression 
planes of the final examination score on the percentile rank in the high school and 
on the initial score are for both sexes identical. It is not superfluous to consider 
this hypothesis, as we may have evidence of some difference between the two 
regression planes, without being able to state the actual nature of this difference, 
whether in some regression coefficient or in the constant term. 

S,? calculated from (98) is found to be 17868-8. Therefore 

t= §,2/S,2=-9770. (124) 
The hypothesis tested is expressed by means of r=3 equations. Therefore the 
Beta Function Table should be entered with r=3 and we have Co, =-9498. 
There is no sufficient reason for rejecting the hypothesis that the two regression 
planes are identical. 

If we had the information a priori that both A, = B, and A, = B,, then instead 
of testing the hypothesis H(z’, y’) or H {(Ay= By) (A, = B,) (A, = B,)} we should 
test the hypothesis H {(A,= B,) | (A, = B,)(A,=B,)}, assuming that in addition 
to the equalities A, = B,, A, = B, which we know for certain, we have also Ay= By. 

When testing this hypothesis S,? is equal to S,? calculated for the hypothesis 
H {(A,= B,)(A,=B,)} and S,? is identical with that calculated for 

H{(As=B,)(4A,= Bj) (4,= B,)}- 

Therefore CES 7/84 =-9916: (125) 
The number of equations by which the hypothesis tested is expressed is again r= 1. 
Therefore we have to compare ¢ with .,, =:9705 corresponding to n—s=222. It 
is seen that, even if we had the a priori knowledge of A, and A, being respectively 
equal to B, and B,, the observed divergency between the values of z and u 
would not be sufficient for the rejection of the hypothesis ey in fact, the two 
regression planes are identical. 

However, it should be remembered that the lack of aint reasons for re- 
jecting a hypothesis does not necessarily mean a proof that the hypothesis is true. 

We have thus fulfilled the comparison between the final scores of the two sexes. 
This was done assuming that the only preliminary data concerning each individual 
were the values of x and y. If we had some further data, such for instance as the 
social level of each student, then the same method of approach could be followed 
to obtain a more delicate comparison. If the number of classes in which all 
students could be divided according to their social position is m, then probably 
in most cases we could allow for this, writing instead of the equations (55) and (56) 

6 (z)=Ayp+A,x7+A,y+ G;,, (126) 

6 (u)= By + Byx+ Buy+ G;, (127) 
where 7=1, 2. ... m means the order number of the particular class to which a 
given student belongs. and where 


SG 0: (128) 


90 Linear Hypotheses and Some Educational Problems 


G,; being a correction for the specific learning ability of the ith social class. The 
constant terms A, and B, would now refer to the average learning abilities for all 
social classes covered by the inquiry. Starting with the above equations (126) 
and (127) we could follow exactly the same line of argument as described above. 
Besides the hypotheses which have been already considered we could test also 
others, e.g. whether the specific learning abilities of students belongmg to 
different social classes are the same. This hypothesis would be expressed by means 


of the m— 1 equations CPG Ga ec. (129) 


It will be noticed that the method advanced in this paper covers the field for 
which the methods of analysis of variance and covariance have been devised by 
R. A. Fisher. In both cases a comparison is made of certain sums of squares, and 
as far as the machinery of the tests is concerned, in one case 

C= 8.7/8? (130) 
is referred to the Tables of the Incomplete Beta Function with 
P=3(N—S)=3Mg, Y=3T= 3M; 

2= log. {(S,?— S,”)/m} — 3 log. {8,?/Ns} (131) 
is referred to R. A. Fisher’s Tables of 5 per cent. and 1 per cent. limits of z with 
degrees of freedom n, and n,.* There is, however, a considerable difference in 
the method of approach. Our method is based upon a single principle, the use of 
the likelihood ratio of Neyman and Pearson; in any given problem as soon as the 
class of admissible hypotheses and the hypothesis tested have been defined, the 
application of this principle leads automatically to a justifiable solution. The 
necessity for this preliminary definition appears to us invaluable. Without a 
unique principle, the solution of new problems has as it were to be picked out 
afresh by intuition, and there have in the past been a number of instances in 
which intuition has failed and an inappropriate or inaccurate test has been 
employed. This lack of generality in the customary method of approach is 
emphasized by the historical subdivision of the method into two sections: 
analysis of variance and that of covariance. 


in the other case 


IV. SUMMARY OF RESULTS 


A broad class of educational problems has been discussed. It was possible to 
show how these problems may be reduced to those of testing “linear” statistical 
hypotheses. 

Starting with general ideas of testing hypotheses developed by Neyman and 
Pearson, and with certain recent results of St Kotodziejczyk, the problem of 
“matched groups”’ has been discussed and a numerical illustration given. 

It was possible to show that the problem of “matched groups” may be 


* Statistical Methods for Research Workers (1935), Table VI. 


PALMER O. JOHNSON AND J. NEYMAN 91 


generalized so as to allow both a more detailed analysis of experimental data and 
a greater accuracy of results. When treating this more general problem, a new 
conception, that of the ‘“‘region of significance’’, has been introduced. 

As an application of the theory the problem of sex difference in learning ability 
has been discussed. The numerical results obtained are based upon original data 
collected by Clara Brown and Palmer O. Johnson. 

It will be noticed that the methods developed may be applied not only to 


educational problems but also in many other branches of science, such as agri- 
culture, biology, etc. 


V. NUMERICAL DATA 


TABLE III. Brown and Johnson’s data for 145 male students. 


Notation: x=percentile rank at high school. 
y=score at initial examination. z=score at final examination. 


Linear Hypotheses and Some Educational Problems 


u=score at final examination. 


92 
TABLE IV. Brown and Johnson’s data for 81 female students. 
Notation: x=percentile rank at high school. 
y=score at initial examination. 
No. x y 


eed 
NK COMDNMOAOPWNeE 


TABLE V. Supplementary Table of the Incomplete Beta Function, giving 


4 
the values of C, for which (B (p, | 
0 


(i) Upper figures correspond to «=-05. 


“gP-1 (1 —2x)%1dx=a. 


(ii) Lower figures correspond to «=-01. 


2-32 i ee ee bee ey 
G=-6||gul-0 |gul g=2-0 | gu25|oua0l gael geao 
| r=2 r=3 r=4 r=5 r=6 t=7 r=g |! 

p= 60 .9621 | -9418 | -9251 | -9103 | -8966 | -8838 | -8717 | -8601 p=50 
n—s=100| 6 | -9355 | -9120 | -8932 | 8768 | -8618 | -8480 | -8350 | -8227 | 6 | n—s=100 

wae .9683 | -9513 | -9372 | -9246 | -9129 | -9019 | -8915 | -8815 ech 
n—s=120| 5 | -9460 | -9261 | -9101 | -8961 | -8832 | -8713 | -8600 | -8492 | 5 | n—s=120 

nein 9746 | -9608 | -9494 | -9391 | -9295 | -9205 | -9119 | -9036 p=i5 
n—s=150| 4 | -9566 | -9404 | -9274 | -9158 | -9052 | -8953 | -8859 | -8769 | 4 | n—s=150 

p=100 9309 | -9705 | -9618 | -9539 | -9465 | -9396 | -9829 | -9264 p=100 
n—s=200| 3 | -9673 | -9550 | -9450 | -9361 | -9278 | -9201 | -9128 | -9058 | 3 | n—s=200 

p=150 -9872 | -9802 | -9743 | -9690 | -9639 | -9592 | -9546 | -9501 | | »=150 
n—s=300| 2 | -9781 | -9698 | -9629 | -9568 | -9572 | -9458 | -9407 | -9358 | 2 | n—s=300 

p=300 -9936 | -9901 | -9870 | -9843 | -9817 | -9793 | -9769 | -9746 p=300 
n—s=600| 1 | -9890 | -9848 | -9312 | -9781 | -9752 | -9725 | -9698 | -9672 | 1 | n—s=600 

E 0 /1-0000 |1-0000 |1-0000 |1-0000 |1-0000 |1-0000 |1-0000 1.0000 101 a 


PALMER O. JOHNSON AND J. NEYMAN 93 


VI. SUPPLEMENTARY TABLE OF INCOMPLETE BETA FUNCTION 
Notation. 


r= the number of equations by means of which the hypothesis tested is expressed. 

nm=number of observed values of the experimental character=number of 
individuals. 

s= number of independent parameters of which the population mean of the 
experimental character is assumed to be a linear function with known 
coefficients. 

p= }(n—s), g=}r. These two constants are introduced here in order to indicate 
the connection between the present supplementary table and the fundamental 
Pearson’s Table of the Incomplete Beta Function, where p and q are used as 
independent variables. 

t= 600/(m—s) is an auxiliary independent variable, which is introduced only for 
purposes of interpolation (see below). 


Interpolation. 


In order to calculate the value of ¢, for any value of n—s> 100, start by 
calculating the corresponding value of t= 600/(n—s). Then interpolate linearly 
between figures in the corresponding column, using ¢ as independent variabie. 
Note that whatever «, the value,corresponding to t=0 is equal to 1 (see last line 
of the table). 


Example. 


Let r=5, n—s=1000. Then ¢=-6 and in order to obtain the value of ¢.,, we 
have to interpolate between 
l.95 = 10000 corresponding to t= 0, 
C.os= °9817 corresponding to t= 1, 
which gives €.9; = 1:0000 — (-6) (0183) =-9890. Similarly we get C5, =-9851. 

The table has been calculated at the Department of Statistics, University 
College, London, by Miss Florence N. David, who also did all the remaining 
numerical work for the paper. We have much pleasure in expressing to her our 
hearty thanks. 

The method employed in calculating Table V consisted in alternate applica- 
tion of quadratures and interpolation. The authors are aware that it is possible 
to use other tables in connection with the problems treated—for instance 
R. A. Fisher’s z-tables—but the use of the Incomplete Beta Function Table 
seemed to be more convenient. If the z-transformation is used the relation be- 
tween z and { is shown above in equations (130) and (131). 


ON THE ANALYSIS OF & SAMPLES FROM EXPONENTIAL 
POPULATIONS WITH ESPECIAL REFERENCE TO THE 
PROBLEM OF RANDOM INTERVALS 


By iP. V .SURHAIME 


CONTENTS - 

PAGE 

I. Introductory 94 
II. Tests associated with the exponential law 96 
(1) Derivation of appropriate criteria : 96 

(2) Distributions of some auxiliary functions /, i a) a : c ‘ 99 

(3) The moment-coefficients of )-criteria and the distribution of L,=) a LOO 

(4) The distributions of L-criteria for the special case k=2 : é . 103 

III. Approximation to the frequency distributions of L, and L, . : : - 105 
IV. Practical illustrations 107 
V. Comparison between the “interval” and the “count” methods of analysis . 110 
VI. Summary 111 
References 112 


I. INTRODUCTORY 


In applying statistical technique to determine whether a series of events occur 
randomly in time (or space), it has been a common practice to make use of the 
fact that if this hypothesis is true the frequency of these events occurring in a 
fixed interval of time (or space) will follow a Poisson distribution. In dealing with 
events in time, this fact has been turned to good account in several investigations, 
among which may be mentioned those of Greenwood and Yule (1)* and Newbold @) 
regarding factory accidents and of Rutherford and Geiger (3) in connection with 
the discharge of « particles. A similar conception regarding randomness in space 
underlies “‘Student’s”’ (4) analysis of counts under the haemacytometer, those of 
Matuszewski and Supinska(5) and those of Przyborowski and Wilenski(). 
Another method of attack which has been suggested by various writers, including 
Bortkiewicz (7), Morant (8) and Neyman and Pearson (9), is based on the analysis 
of the intervals between the events which, if they be random, should be dis- 
tributed according to the exponential law. The first method may be described 
as that of count analysis, the second as that of interval analysis. 

As in analogous problems where the variation follows the Normal law the 
questions for practical investigation are concerned not only with the frequency 
distribution of a single long series of data but with an analysis of these data into 
subgroups suitably arranged to detect lack of homogeneity. As an illustration of 


* The figures in brackets refer to the list of publications quoted at the end of this paper. 


P. V. SUKHATME 95 


this point we may consider the data shown in Table I, collected by the London 
Telephone Service.* The figures represent the length of intervals between the 
arrival of successive calls at the Franklin Exchange from each of several different 
exchanges. All records were taken in the “busy period” 11.0 a.m—1.0 p.m. and 
the unit is half-a-minute. 


TABLE I. Length of intervals between successive telephone 
calls in half-a-minute units. 


Exchange Gerrard Trunk calls} Toll calls | Kensington 


Date of record in 
November 1929 | 14 15 


Interval no. (12) | (1) 


1 6 1 2 5 9 
2 1 10 42 | 15 0 
3 16 7 ae: 22 ; 19 2 
zt 19 8 1 119 9 8 
5 13 4 14 12 (33) | 23 34 
6 9 | 26 52 25 — | 15 5 
7 2 | 15 15 23 — | 26 1 
8 1 3 aa 1 | 80 — | 19 5 
5 1 3 9 | (29) | 22 — |17 3 | (52) 
10 28 5 15 | — | 18 — | 33 5 | — 
1l 7 9 24) — (4) — | 15 aa 
12 48 8 2;—}— a 1 Fat i 
13 16Sia 11 3} — | — — | (40) ize | — 
14 3 | 40 (5); 10 | — | — — | — ye) ease 
15 15.58 — 9;/—}— —|— 98 == 
16 2 ] —_— 2;—]— — | — (69) | — 
17 2 ] — | 64] —}] — a Eee ieee 
18 2 | 3i —_ qd); — | — — he eee ae 
19 8 5 —}—]}—] — — | — end MP Tse 
20 1 14 —}—}]—-—)];—-— a Jax eee 
21 1 16 —}|]—}]—-—}]—- — | — ed ge 
22 2 (3) —}/—};—}]— —— aa nigelWa teens 
23 12 | — —{|—}]—] — Ss ip hy | ee 
24 (13) | — —}—}]—]— —|— ae 


Note. The figures in brackets represent time in half-a-minute units between 11 a.m. 
and the first call or between the last call and 1 p.m. 


Questions that might need investigation are as follows: 

(a) Have the calls arrived at random in general ? 

(6) Whether the records on different days or on different times of the same day 
for a given exchange could be combined together without loss of homogeneity ? 

(c) Whether the intensity of traffic at different exchanges differed significantly 
on the particular days in question? 

Again, it is known that for traffic handled on a delayed basis the delay follows 
an exponential law if the holding time of a call be assumed to be so distributed (10). 
In this case it may be desirable to test different hypotheses concerning the average 


* IT am indebted to Mr W. F. Newland for kindly putting these data at my disposal. 


96 Analysis of k Samples from Exponential Population 


holding time and the average delay for a series of calls arriving at different plants. 
Further, given a number of plants with the same equipment and similarly situated 
with respect to factors such as the intensity of traffic and the average holding 
time, it may even be possible to test for the variability of the human element, 
such as an operator’s ability to speed up at times of heavy traffic, which enters 
into the reckoning of the delay problem. 

A similar series of questions arise in connection with accident data. ‘It may be 
required to ascertain whether: 

(a’) The lengths of interval between consecutive accidents incurred by the 
same individual worker are distributed randomly? 

(b') The worker is more cautious for a certain period after an accident? 

(c’) The liability to accident varies considerably among the workers ? 


The first purpose of this paper is to show how tests may be developed based on 
the distribution of intervals between random events; these tests associated with 
variations following the exponential law are precisely analogous to the tests 
classed by R. A. Fisher under the heading of Analysis of Variance in the case of 
normal law variation (11). Afterwards, the methods will be illustrated on some 
telephone and accident data and the uses of the “interval” method of analysis 
will be compared with those of ‘‘count”’ analysis, the technique of which will 
shortly be discussed elsewhere by the author. 


II. TESTS ASSOCIATED WITH THE EXPONENTIAL LAW 


(1) Derivation of appropriate criteria. 


The probability function of a variable X following the exponential law may be 
represented by the equation depending on two parameters 


p (X)=—e —e for X2>B a) 


p(X)=0 for X< B 


where f is the distance of the start of curve from the origin and o is the standard 
deviation and also the distance of the mean from the start, i.e. mean X =B+o. 
In the problems to be considered the samples may have been drawn from a 
number, say &, different distributions of the form (1) for which f, and a, 
(t=1, 2,... k) differ. A variety of hypotheses may need to be tested, which can be 
illustrated on the problems discussed in the preceding section. 

In the case of the telephone calls where X is the interval between arrival of 
calls, 8 must be zero, since calls can arrive simultaneously. Problem (a) is that of 
testing for the randomness of arrivals, allowing for different values of O;, 1.e. 
examining whether there is justification in using the exponential theory, allowing 
at the same time for variation in intensity of traffic on different exchanges or 


P. V. SuUKHATME 97 


different days. Problems (b) and (c) reduce to testing whether there is a common 
value of o, at different times or in different exchanges. In the case of accidents 
the problems are similar, but here it is possible that 8,40, as there might be a 
closed period after one accident before a second will occur, Owing perhaps to a 
change in conditions of work. 

If a large number of observations X are available which could be assumed 
homogeneous, the appropriateness of “exponential theory’’, e.g. the justification 
of the assumption of randomness, could be obtained by fitting a curve of the form 
(1) to the combined data and applying a test for goodness of fit. If, however, the 
observations fall into a number of small groups within which it may be assumed 
that there is homogeneity, but between which either B or o or both may change, 
this method is not available. Neyman and Pearson have, however, suggested a 
suitable test which can be used if o changes from group to group provided f is 
known. The use of this test will be illustrated below; at present it will be assumed 
that the exponential law (1) is appropriate. 

Suppose that k samples are available, 4,, Z,, ... L,; that X, contains n, obser- 
vations X,; having a mean xX, (t=1, 2,...k;7=1, 2, ... m,) and has been drawn from 
a population I], for which (X-B,) 


It follows that the probability function of X,; (¢=1, 2, ... k; 7=1, 2, ... m,) is given 
by, say, ee Ce: 
PO; Dye. Dy) = Pee ak (2) 


In general the following three hypotheses will be considered here, which will 
be denoted by Hy, H, and H, respectively. 

(1) The hypothesis H, that the k samples (2,, X2, ... 4,) have been randomly 
drawn from the same exponential population, i.e. that II,, II,,... I, are identical. 
In other words it is desired to test whether 

Bi = Be=---= By (3) 
and 0; =0,=... =O. (4) 

(2) The hypothesis H, that the samples have come from populations with the 
same o, but with f; having any different values whatsoever; that is to say, it is 
wished to test whether (4) is true, irrespective of what happens to (3). 

(3) The hypothesis H,: here it is assumed as known that besides the popula- 
tions being exponential, o, are identical, i.e. (4) is true. The hypothesis to be 
tested is then that (3) is true. 

The choice of the criteria will be based on the use of the likelihood ratio as 
suggested by Neyman and Pearson (9,12). In the case of Normal variation this 


method leads to the usual Analysis of Variance tests and to certain new tests of 


PSM 7 


98 Analysis of k Samples from Exponential Population 


practical value. The method consists in defining (a) the set Q of admissible 
populations I, from which the & samples may have been drawn, and (b) the subset 
w to which the populations must belong if the hypothesis be true. Next.maximum 
values are found for the p of equation (2) associated with k populations, (a) from 
Q, (b) from w. Then the criterion to be used in the test is 


_ p(w max) : (5) 


p (Qmax) 


The details of the method in deriving the criteria appropriate to H,, H, and H, 

are similar to those given by Neyman and Pearson (13,14), Pearson and Wilks (15), 

Kotodziejezyk (16) and Welch (17) and will not be given here. It will be found that 
os k 

nt (X,—Xp)™ fe im 

De oy Eph (say), (6) 


where X,, denotes the lowest observation and 56 the mean in the tth sample; 


An= 


re 1 & = 
X., denotes the smallest of the k lowest observations X,; and X o= FH x n,X;, 
t=1 


k = 
where N= X 1, ie.: Xq is the mean of the whole N observations. Further, 
t=1 
1, = X,— Xj, ie. is the distance between the mean and the lowest observation for 


the tth sample; and ],=X,—X.,, ie. the distance between the mean and the 
lowest of all the NV observations. For the hypothesis H, 


where / denotes the weighted mean of 1,, J,, ... j,. Finally for H, 


v em Xy\* 
(X.- 2 "y") i) 
=(;) - 


(Xo— X.)% 
It will be noticed that Any Ag, X Ages (9) 


(8) 


rx = 


It is worth while to note the analogy of these tests with the corresponding tests 
for Normal variation; the lowest observation and the distance between the mean 
and the lowest observation play here the réle which was played by the mean and 
standard deviation in the case of Normal variation.* 


* The analogy appears to be promising in getting a number of interesting results. Some of these, 
such as a system of functions analogous to a set of orthogonal functions used in the Analysis of 
Variance tests and others, will shortly be discussed elsewhere. 


P. V. SUKHATME 99 
(2) Distributions of some auxiliary functions 1, l and v. 


Having obtained the criteria, the next purpose is to obtain their sampling 
distributions, or at any rate the moment-coefficients of these calculated under 
the assumption that the hypothesis tested be true. For this purpose it is essential 
to know the frequency distributions of some functions J, J and v defined below. 


FIG.1. DIAGRAM ILLUSTRATING THE NOTATION. 


POPULATION CURVE 


SAMPLE 
DATA 


% THIS IS THE POSITION OF MEAN(X,,). 
DOTS REPRESENT SAMPLES OF N(&6). 


Change the origin to the start of the population curve; writing Hy = Ki — Be 
%,and 2, will now be the coordinates of the mean and the lowest observation in the 
ith sample. Further, x. will be the smallest of the & values of x, and %, their 

k . . . . 
weighted mean, namely 2 D (n,%). The notation will be clear from the adjoining 
t=1 


diagram. _ 
No Toy +... +p h 
Write ya Ft a Te a 
1 * 
Sine, (12) — X14 (10) 
= ie 


Univ. of Arizona Library 


100 Analysis of k Samples from Exponential Population 


The distributions of 1, and i. Starting with equation (58) on p. 224(9), which 
gives the frequency function of «,, and J,, namely 


nr x Ms (4 +24) ; 
Din, aon (—\inrte (11) 
we have on integrating with respect to x, between the limits 0 and infinity, 
n\e1 1 EMLTE 
=(— Lu? : 12 
pi=(2)" —aiirete (12) 


a Pearson type III curve. 
Adopting the results of A. E. R. Church (18) for the distribution of means in 


repeated samples of n from a Pearson type III curve, we have, further, 
i 


_ 2 Bet 
p(l)=constant (l)*-*-1e ¢ . (13) 
The distribution of v. Since 
ry Be Tey ; 
P (%a)=— e cats, ae (14) 


an exponential curve with standard deviation = (9), the joint probability function 
t 


of the k lowest observations is given by 
N 


1D (243; Lp, +++ Ly) =constantbe °°”. (15) 
Substitutin, k 

2 = X41, 2g = yy, Zp = Uys o0- Se Uy 
and observing that for a given set of values of 2, 2, 24, --. 2 the variable z, 


cannot be less than z, and takes its highest value when x,,=2,,;=2, (on the 
assumption that z,, is the smallest of all the k lowest observations), we have on 
integrating for z, 
= (124 + Ng% + Ngzy + MgZq+-.. +ny2)} . 
(17) 
Integrating in this way successively for all the variables z, we have, on writing 
back 2) =k2%, and z,=2},, 


y 
——-Z% 
P (Zo; 21> 24> ++» 2) =Constante ke {5 


p (%q, %4,) =constant (%4—2,)k%e 2%. (18) 
Finally, substituting for v and integrating with respect to 2,, between the 


limits 0 and 0, we have 
“ 


p(v)=constantvt-2e °°, (19) 
1 
(3) The moment-coefficients of d-criteria and the distribution of L,=)y,N . 
Neyman and Pearson(i4) have found it convenient to study the sampling 
distributions of some fractional power of the 2’s, rather than that of the )’s 


P. V. SuUKHATME 101 


themseives. owing to the extreme skewness of the latter distributions. In the 
present case it will be found for similar reasons that some advantage will be 
gained by using the 1/.Vth power of the d’s. Further, in keeping with their 
1 i 1 
notation, let Ly, L, and L, denote A,,,) . Ay,-V and A,,V respectively. 
1 

(a) The case of Ay,\ or Ly. If the hypothesis Hy be true, then the / samples 
would be drawn from a common population with start at 8 and standard deviation 
o: sothat the frequency function for simultaneous variation in 2, and], (t= 1, 2.... k) 
may be written as 


k _— 
P{h, le, -.. iy X41, Xe1, --- Xq}=constant II (1)"-e : 
t=1 


k els al 
=constant II (1,)""*e o xe nek (20) 
The next purpose is to transform the variables and integrate out for certain of 
them until we are left with the frequency function for k new variables €,, &, ... & 
detined in equation (23) below. LI, can be expressed easily in terms of these 
variables, and the values of its moments can then be readily obtained. 
It is clear from equation (20) that the frequency function is written as the 
We 
product of two independent functions. The latter function, e ° ", corresponds 
exactly to that given in equation (15). Hence applying the methods of the pre- 


ceding section, we have p 
k Bah eS 
a eee eae ay comtant it (,)*-2e ° x(%,—2y,)F%e « *. (21) 


Substituting further, ],=1+%,—2.,, we have. on integrating with he ae: to X.4 
between the limits 0 and «, 


k > cals 
pi{ly,1,, --. ,}=constant II (I)"-*(I,—1)*%e °°. (22) 
t=1 
Finally transform the variables 
L L L,. ; 
oii iain y, eee a Oe (23) 


Since the Jacobian of the transformation is /,*, we have, after integrating with 
respect to /,’ between the limits 0 and 0, 
k = 
p (é1, -»» &)=constant Tg"? (1-2), (24) 


t=] 


where & denotes the weighted mean of k variables ¢,, ... €, defined above. 
Now it follows from equation (6) that L, can be written as 


102 Analysis of k Samples from Exponential Population 
Denoting by pz,’ the pth moment-coefficient of Ly about 0, we have 


1 
ty’ =| Ly p (Ly) Ly 


< 


k "ty “te Ne ==) ane 
= constant ies [ IT (é-V \(L = €)- db. dey, (26) 
Js t=1 
where the region of integration is extended over all positive values of the variables 
such that €< 1. 

Now since it could be shown that ((19), p. 252) 


ies eee woe yh f (U + Let... +2) dX... dX, 


Tepe | Shs we 
= Te 2 
T (ay + G+... +a) we oe ae 


e 


we obtain, on evaluating the constant in (26) by making p,'=1, 


e P(n 147) 

PROPEL UO Eye Oa (085 
My OUST (N+ p— Vitor Be 
nV 1 ( —1) 


A special case of importance occurs when 2, =...=”,,=n; then 


T(N—1) [p(n re +FY/ 


Bea ies eee et ee ee é e245) 
ep Deva Pte) | en 
(b) The case of L,. It is obvious that the pth moment of L, about zero will be 
given by 
pe! 


My = 8 (L,)dL, 


LEN 
ea le mere 
= constant [ =| = iI me v1 dl,...dly. 30 
Jo Jo \ Und, t=1 on 
ee 
Substituting wl = 2? (fat Eo e2 hk): (31) 
k os p 
k Np oO ea I a7 s b - 5% Zs" 
i,’ = constant II el. all = TI 2,2%-8e =i 0 dz, ... dz,.°(32) 
fei “AES 0 J 0 Sige t=1 
(7%) Ried 
k-1 
Further, let z,=r II cosd, | 
i=1 
, (33) 


k-1 
and z,=rsing;_, Il cos¢; for j=2,3,...k | 
T= J 


P. V. SUKHATME 103 


; k 
then since U2fi=r2 (34) 
j=1 
Ghee Sess ee) as a 
and aes k! | —,k-1 T] i-1g. 
0 (r, by, «++ bx-1) ‘ wae i> 39) 


we get now, on integrating with respect to r between the limits 0 and o and 
evaluating the constant by putting py,’ =1, 


r(v-k) # (nmi ne 


For the special case n, =n.=...=”,, we have 


ea ls 
pp 2 -¥) int za} Hs 
(Pike pk in Wi ons a 


(c) The case of L,. The moments of L, would be more readily obtained from 
the distribution of L, itself, which can be simply deduced as follows: 
It is easy to see from equation (21) that 
N; N 
4 = Sey ail 
p(l,v)=constantl*-*-1e « yk-2%e o oe (38) 


Substituting w = and observing that / and v are independent, we have, on 
integrating with respect to v between the limits 0 and oo, 
ip (w) =constant w¥—-*-1 (1 + w)--9, (39) 


It will be noticed from (8) that 


de | w 
ee oe 40 
T(N-1 
Hence p(L,)= TwW-brER) a r ‘= D Lig AO (k= Dyes (41) 


T(N-1) 0 (N+p—h) 


and consequently pe (L,) = T(N—&)T(N+p—1) . (42) 


It will be noticed from equations (28), (36) and (42) that 
Py (Ly) =p’ (L,) x Bp (L,). (43) 


(4) The distributions of L-criterva for the special case k= 2. 
(a) The distribution of L,. Starting from equation (24), we have, for k=2, 


p (é1, €2) = constant €,%1-? €,”2-, (44) 
To obtain the distribution of L, we must substitute for é, from 
Ny, Ne 


Ly=t," EN 


104 Analysis of k Samples from Exponential Population 
and integrate (44) with respect to ¢, subject to the condition that 


M81 | Moke 45 
we tow <}: (45) 
Denoting by é,' and €,” the roots of the equation 
N 
Ly | 
ME +e (42)=n (46) 
Ef 
= N s fy” 
we have p (L,) =constant is My ab it (47) 
Y,2-— 
g; 


When n,=n,, equation (47) turns out to be 


p (Ly) = constant D,?"-* cosh} (z) 5 (48) 
0 


The integral representing P {L,< L,’'} for any fixed: value L,’ could be calculated 
directly from (44). The formulae are particularly simple when n,=7",=n. Then 
we have 


Ly’ 
P{Ly<Ly}=| p(y) aL 


_ Tm} paaed Bil. (en ae a, 
Sar aceas] Goal 2 cosh otf oft (1 yay} (49) 


(6) The distribution of L,. Starting with the frequency function of /, and /,, 


] 
a (ml, +7215) ; 


p (t,1,) = constant 1,%-* 1,%2-? e (50) 
and introducing new variables / and ¢, 
l, = Nit 
Nyy : | (51) 
tigl,= NI(1—1) 
where 0</ and 0<t<1, we get ; 
Nl 
p(i,t)=constant1%¥-8e o ¢m-2 (1 — 4) 2-2, (52) 


and consequently 
p (t)=constant ("1-2 (1 — t)™-?= B-1(m,—1, n,— 1) #4-2(1—t)™-2. (53) 
Now L, can be expressed as a function of ¢ only, 


L=n (C—") (54) 


n n 
Ny INg 2 


and it follows that the probability integral of L, can be obtained from (53). 
Denoting by P{L,>L,’) the probability of L, exceeding any given number 
L,' <1, we have ri 
i t™-2 (1 —t)"-2 dé 


Tni—dh 
BNE > “Vl ceased aati 


(55) 


P. V. SuUKHATME 105 


where ¢, and é, are the solution of the equation (54), in the left-hand side of which 
we have to substitute L,’ instead of L,. 


In the case n, =.= (say) the solution is particularly easy, as 


f4=$(1-V1-L,”), t=4(14+V1—-L,?). (56) 
Substituting in (55), t=}3(1+V4u), (57) 
we get silico 


(l—u)"-2u-4 du 
B(n—1, 4) 


Differentiating this formula with regard to L,' and changing the sign, we obtain 
the elementary probability law of L,, 


pL) = papaya lL. (59) 
Hence it follows that p (L,?) is of Pearson type I form.* 

Knowing the distributions of L, and L, it will be noticed that the distribution 
of Ly can be deduced by making use of the results (9) and (43). Take the simple 
Case 2, = Ns. | 

Since we know from (41) and (59) that 


p (Ly, L,)=constant L,2-* (1 — L,2)-+ L,%-, (60) 


P{L,>L,}=— (58) 


we have, on substituting L,=72 in (60) and integrating it with respect to L, 
2 


between the limits L, and 1, 
ape 
p (Ly) = Ly?" log bev ty) = L,2"-8 cosh- i ), (61) 
Ly Ly, 


which agrees with (48) as expected. 


III. APPROXIMATION TO THE FREQUENCY DISTRIBUTIONS OF L, AND L, 


In order to make the use of L tests available in practice it is necessary to obtain 
suitable approximations to their distributions. It is seen from equation (41) that 
L, is distributed in Pearson type I curve, and since L, and L, lie between 0 and 1, 
it seems reasonable to assume that their distributions may be represented 
approximately by a similar law. Thus it will be assumed that approximately 


T (m,+ mg) 
L) =<, OP (1-2) (62) 
PO) Fm) T (rng) 

* I would like to take this opportunity of correcting here a statement on p. 591 of my paper (20). 
The substitution (34) and the derivation therefrom of (35) are not legitimate, since L, is not 
monotonic with regard to fp. When n,=n2, L, is a symmetrical function of 8 and equation (36), 
therefore, remains valid. 


106 Analysis of k Samples from Exponential Population 


where m, and m, may be expressed in terms of the first two moment-coefficients 
of L about zero, as follows: 
_ Py (Ha' = Ha) fsbo Pa) is He’) (63) 
lap eal P25 4 

This is the form of approximation suggested by Neyman and Pearson in the 
case of Normal variation (14). It has since been investigated by Nayer for the L, 
test (21). It is also found satisfactory in the case of their L, test (k= 2), the true 
sampling distribution of which is known (22). It is, therefore, likely to prove 
adequate in our case, especially for the L, test, because the pth moment of the 
Neyman and Pearson L, test becomes identical with the expression for pu,’ 
given in equation (37) by simply varying the sample size n. 

A further test is to compare the third and fourth moments of the approxima- 
tions (62) with the corresponding true moments obtained from (29) and (37). 
Tables II and III give this information in the form of a comparison of the values 


Mm, 


TABLE II. Frequency constants for the distribution of Ly. 


Values f 
True values from equations (29) Pomerania (63) 


n | k | Mean L,| 8.D. Ly By B 
| ——— ——_ 
5 10 -7268 -0627 -0779 3-0000 


15 5 9191 | -0314 | -4896 | 3-6478 


Values for 
distribution (63) 
Be 
3°1951 


5-343 


3115 


1-810 


of f’s in certain cases; the correspondence suggests that the probability integral 
of (62) will give an adequate approximation to the unknown true value. The 
chance that L< L’ will then be given approximately by 

P(L<L')=I,,(m,, ms), (64) 
which may be obtained from the Tables of the Incomplete Beta Function (23) or by 
means of R. A. Fisher’s z-transformation as suggested by E. S. Pearson (24). 

In practice it is often sufficient to know the values of L’s corresponding to 
P=0:05 and P=0-01. In this connection, it is useful to note that Nayer’s 
5 per cent. and 1 per cent. tables of Neyman and Pearson’s L, test (21) could 
be utilized to obtain corresponding levels for L, in my case by entering with 
f=2n—2. That this is the case will be seen by comparing my equation (37) 


with his equation (5) (p. 39 above), and remembering that N =nk and in his 
notation f=n—1. 


P. V. SuUKHATME 107 


IV. PRAcTICAL ILLUSTRATIONS 


The main purpose of this section is to illustrate briefly how the tests developed 
in the preceding pages may be applied in practice. The example in connection 
with telephone calls referred to in Section I provides a good illustration of the 
type of problem we are considering. 

A first point would be to examine whether calls do, in fact, arrive randomly in 
time. This is a necessary preliminary before using any exponential tests at all to 
analyse the data. For this purpose Neyman and Pearson have suggested the use 
of the following test for random intervals), which consists in calculating a 
quantity g defined by* x,-B 

a 


== 65 
I= ¥_x, (65) 


for each sample and comparing the resulting distribution of g with its theoretical 
frequency law, namel 
ei Y  p@)=(e-1)(.+9)™. (66) 


It is to be noticed that the value of B in g is common to all the k samples. It 
must be given as a part of the hypothesis to be tested or based on a@ priori con- 
siderations. If there is agreement between the observed and the theoretical 
distributions of g, then not only is it probable that the hypothesis as to f is correct, 
but also that the intervals (less 8) are random. 

Table IV compares the observed distributions of g with the theoretical one for 
the two cases (1) n= 4, (2) n=8. The value of f is assumed to be zero, since calls 
may arrive simultaneously. The intervals chosen were the first four and first 
eight for each exchange. Though the total frequency is very small in either case, 
the close agreement suggests that the intervals between arrival of calls were 
random. 


TABLE IV. 


n=4 


g Frequency | Frequency 
observed | expected 


Having found that the hypothesis of randomness is satisfied, we might wish to 
consider problems (b) and (c) given on p. 95. We shall only consider one of these 
here, say (c), namely, Does the intensity of traffic vary significantly at different 


* g is there z’”. Actually of course for a final establishment of this test we ought to know how 
sensitive it is to departures from the exponential law. 


108 Analysis of k Samples from Exponential Population 


exchanges on the same day and hour? For this purpose it is necessary that the 
records from several exchanges we wish to compare should be taken on the same 
day and hour. In the data we are analysing this correspondence in time holds good 
for (i) the Gerrard and Central calls. (ii) the Monitor. Trunk and Toll calls. 

It has been shown in Section I that the problem of comparing the intensity of 
traffic is equivalent to testing the hypothesis H, that there is a common value of 
o, for the exchanges we wish to compare. Thus to compare the intensity at the 
Gerrard and Central stations on the 14th of November, we calculate the value of 
L, from columns (2) and (3) of Table I and compare it with the corresponding 
5 per cent. level of significance. Table V makes this comparison for four different 
cases. For convenience, only the special case of samples of equal size is con- 
sidered here. The L, 5 per cent. values were obtained from Nayer’s tables of 
Neyman and Pearson's L, test (21), entering with f=2n—2.* With the possible 
exception of the second case, it appears that the intensity of traffic was much the 
same at the stations compared. but it must of course be remembered that the 
data, while useful for illustrative purposes, is too scanty for any thorough in- 
vestigation. 


TABLE V. 
No. : a | Date | k | m Observed L,) 5% level | 
eet eal ed alle Pqas sot sig Foe ae 
Z Gerrard and Central calls Asso tap 2 ide 92] -922 
=. Monitor, Trunk and Toll calls, 21. xi. 30 30 4 908 -700 | 
4 | Monitor, Trunk and Toll calls : 22.xi30 , 3 | 12 | -991 © -912 | 


i i 


In considering the simple case of samples of equal size we are virtually wasting 
some information contained in the data. The methods developed in the preceding 
pages apply equally well to samples of all size. equal or unequal. With this point 
in view let us reconsider the case 2 in Table V. Twenty-two calls are received from 
Gerrard and 14 from the Central between 11 a.m. and 1 p.m. on the 15th of 
November, providing us with 21 and 13 intervals respectively. The observed 
value of L, is now -969: while the 5 per cent. level obtained by using the approxi- 
mate method of the preceding section is -940. It appears, therefore, that the 
intensity of traffic at Gerrard on the 15th of November was not significantly 
different from that at Central Exchange. 

Let us next consider the problem of factory accidents referred to in Section I. 
The problem is discussed at length in Report 34 of the Industrial Fatigue Research 
Board. where Miss Newbold has analysed a large amount of data by the “count” 
method of analysis. It is not here intended to enter into the details of the problem. 
but to illustrate briefly how the interval method of analysis might be employed 
to investigate several aspects of the problem. 

The data analysed in Report 34 supplies, for each of a large number of workers, 


* See p. 51 above for these tables. 


P. V. SUKHATME 109 


the information regarding the time and type of accident incurred by the worker 
during the year under observation. The data, since they were collected only for 
one year, contain a large number of workers with 0, 1 or 2 accidents. Such cases 
cannot be treated by the interval method of analysis because there must be at 
least three accidents during the period to give a sample of two intervals—the 
minimum that can be dealt with by the present method. Indeed, if the problem 
of accidents is to be dealt with effectively in this way it is necessary that the 
records must be kept over a sufficiently long period. 

One of the principal aims of Miss Newbold was to investigate whether liability 
to accident varied considerably among a group of workers. It has been shown 
in Section I that, provided the randomness of accidents in time for any one 
individual is established, then the problem is one of testing the hypothesis, H,, 
that there is a common average interval between accidents for each individual. 
For the sake of illustration, I have made use of some of the original data* used 
by Miss Newbold, taking those cards of the series MII (of p. 11 of her Report 34) 
which contains five or more accidents. 

Before proceeding with the H, hypothesis, it is necessary to examine whether 
the intervals between accidents were, in fact, random. In Table VI, I have given 
the values of g for the 125 workers who incurred at least five accidents and the 


TABLE VI. Intervals between accidents. Distribution of g. 


n=4 # n=8 
g Observation Theory g Observation Theory 
0-0-1 30-0 31-1 0-0-1 19 24-3 
0-1-0-2 23-0 21-6 0-1-0-2 16 11-7 
0-2-0-3 15-0 15-4 0-2-0-3 7 6:0 
0-3-0-4 16-0 11:3 0-3-0-4 4 3:2 
0-4-0-5 11-5 8-5 0-4-0-5 2 1:8 
0-5-1-0 16-0 21-4 > 0-5} 2 3-0 
1-0-1-5 4:5 7-6 — — - 
1-5-2-0 3°5 3-4 — — _ 
2-0-3-0 4:5 2-7 — — — 
> 3-0} 1-0 2-0 — — — 
Total 125-0 125-0 Total 50-0 
x2 5-87 aa x? 3 
PAY") “555 = EXY) = 
Observed L, 885 _— Observed L, — 
L,5% 809 — L, 5% — 


50 workers who incurred nine or more. The accident intervals chosen were, there- 
fore, the first four and the first eight of the year respectively. The value of B was 
taken to be zero, i.e. it was assumed that there was no closed interval between 
accidents. The table compares the observed frequencies with the theoretical 
obtained by integrating (67) between appropriate limits. The values of x? and 


* IT am indebted to Professor M. Greenwood, F.R.S., for kindly allowing me to make use of some 
of these original data. 


110 Analysis of k Samples from Exponential Population 


P (x?) are given. Judged by these, the agreement between the observed and the 
expected series appears satisfactory in both cases (1) n= 4, (2) n = 8; this suggests, 
therefore, that the intervals between accidents were random. 

At the lower end of the table are compared the observed values of L, with the 
corresponding 5 per cent. levels. In neither case is L, exceptional, which means 
that there is no evidence of a significant difference in the average interval between 
accidents for different individuals. In other words, as far as can be judged from 
these data, all the 125 workers seem to be liable to accident more or less to the 
same degree. 

It will be noticed that the purpose in choosing only the first four and the first 
eight intervals for each worker was to simplify the process of testing the H, 
hypothesis. In a comprehensive inquiry, it may seem advisable to calculate a 
single value of L, by considering all the available intervals for each worker; but 
I believe that in the case of large numbers of samples, a detailed analysis for a 
few values of n, after the manner shown above, would be equally efficient. 

To those who have read Miss Newbold’s paper (2), it may appear strange to read 
my conclusion on the homogeneity in accidents among the workers; for Miss 
Newbold’s investigations have left no doubt about their heterogeneity within 
each group. A little consideration, however, suggests that there is no real con- 
tradiction between the two; for, while Miss Newbold examined a group as a whole, 
I have analysed only that part of a group which consists of workers who have 
incurred at least five accidents during the year. 


V. COMPARISON BETWEEN THE ‘“‘INTERVAL”’? AND THE 
“COUNT”? METHODS OF ANALYSIS 


The second method of approach starts from the fact that if an event occurs 
randomly in time or space and the variable considered is the number of occurrences 
counted in a fixed time or space interval, then the chances that this variable takes 
values 0, 1, 2, ... 2 are given by the terms of the Poisson Series: 

m* — me 
91 yp eee aa eee 

Thus in the case of accidents, if these occur randomly for each of k workers, the 

chance that the tth worker incurs x, accidents in a fixed interval of time would 


em (1,m, 


yr ; ; dastaiit 
be e~™ — (t=1, 2,... k). If there is a common accident liability, then 

(- 

My=Me=...=M,; and 2, Xo, ... X, 

can be regarded as a sample from some common Poisson distribution. Thus the 
problem of testing the hypothesis of homogeneity in accidents is equivalent to 
the problem of testing whether k single observations, each of which is assumed 
to be drawn from some Poisson distribution, have in fact come from a common 


distribution. 


P. V. SUKHATME 111 


On the other hand we might approach the problem in yet another way. It is 
known that if a number is the sum of several components each of which is in- 
dependently distributed in a Poisson Series, then the total number is also so 
distributed. Consequently, if we choose some convenient unit of time so that 
there will be units of time contained in the period of observation, then the total 
number of accidents incurred by each worker may be regarded as the sum of n 
components each of which is distributed independently in a Poisson Series; and 
the problem of testing for liability becomes equivalent to testing whether k 
samples, each of n observations, have in fact been drawn from some common 
Poisson distribution. It may be remarked here that both the methods of approach 
lead to an identical test of the hypothesis, namely Aj, (22). 

It is, however, important to note the meaning of the assumption that each of 
the k samples have come from some unknown Poisson distribution. For it is 
conceivable that liability to accident may change with age and experience of a 
worker during the period under observation and hence the n observations for 
each worker may not be supposed to have come from some common unknown 
Poisson distribution. The assumption may, however, be readily tested by means 
of the A,, test (22). 

The problem on telephone calls may be approached in exactly the same way. 
It is not my intention to work out here the details of this method of approach-in 
connection with these problems, since the method of analysis and its application 
to different problems will be given in another paper which will soon be ready for 
publication. There it will be found that, like the ‘interval’? method of analysis, 
the “‘count”’ method is also of very wide application and perhaps even of wider 
application in certain respects save in one important case where it is not alto- 
gether available; for problems where there is necessarily a “closed period” must 
be treated by the “interval” method of analysis. 


VI. Summary 


A statistical technique analogous to that given by Neyman and Pearson in the 
case of Normal law of variation(14) has been developed for the exponential 
population. 

This method, which has been called the method of “interval’’ analysis, is 
illustrated on several problems in connection with intervals between random 
events which are known to follow an exponential law of variation. 

Finally, the approach of the alternative method of “count” analysis to the 
problems of the present paper is described and the uses of the two methods briefly 
compared. 

In conclusion, I desire to express my thanks to Professor E. 8. Pearson and 
Dr J. Neyman for their advice and criticism. 


112 Analysis of k Samples from Exponential Population 


REFERENCES 

(1) M. Greenwoop and G. U. Yuux (1920). Journ. Royal Stat. Soc. vol. LXXxXIM. 

(2) E. M. NewBoxp (1925). Industrial Fatigue Research Board Report 34. 

(3) RUTHERFORD and GEIGER (1910). Phil. Mag. vol. xx, p. 698. 

(4) “StupEntT”’ (1907). Biometrika, vol. v, pp. 351-360. 

(5) T. Matuszewski and J. Suprysza (1933). Medycyna doswiadczalna i spoleczna, vol. XVI. 

Warsaw. ‘ 

(6) J. PrzyBorowskI and H. WitEnsKI (1935). Biometrika, vol. xxvu. 

(7) L. Bortsrewicz. Bull. Inst. Int. Stat. vol. xx, 2¢ Livr. 

(8) G. M. Morant (1920). Biometrika, vol. x1, pp. 309-337. 

(9) J. Neyman and E. 8. Pearson (1928). Biometrika. vol. xx4, pp. 175-240. 
(10) E. C. Monrya (1927). Bell Telephone Laboratory Publication. Reprint B. 263. New York. 
(11) R. A. Fiswer (1935). Statistical Methods for Research Workers. 
(12) J. Neyman and E. S. Pearson (1933). Phil. Trans. Royal Soc. A, vol. ccxxxi, pp. 289-337. 
(13) J. NevmMan and E.S. Pearson (1930). Bull. L’ Académie Polonaise Sciences et Lettres, Série A. 
(14) J. Neyman and E. 8. Pearson (1931). Bull. DL Académie Polonaise Sciences et Lettres, Série A. 
(15) E. 8. Pearson and S. 8. Wixks (1933). Biometrika, vol. xxv, pp. 353-378. 
(16) Sr KotopziesozyxK (1935). Biometrika, vol. xxvm, ppv 461-190. 
(17) B. L. Weuce (1935). Biometrika, vol. xxv, pp. 145-160. 
(18) A. E. R. Caurce# (1926). Biometrika, vol. xvi, pp. 321-394. 
(19) W. T. Warrraxer and G. N. Watson. Modern Analysis. 


(20) P. V. SuxmatTmE (1935). Proceedings of the Indian Academy of Sciences, vol. 0, no. 6, pp. 584- 
604. 

(21) P. P. N. Naver. This volume pp. 38-51. 

(22) P. V. SuKHATME (an unpublished essay). 

(23) K. Pearson (1934). Tables of the Incomplete Beta Function. 

(24) E. S. Pearson (1931). Biometrika, vol. xxtv, p. 415. 


SUFFICIENT STATISTICS AND UNIFORMLY MOST POWER- 
FUL TESTS OF STATISTICAL HYPOTHESES 


By J. NEYMAN anp E. S. PEARSON 


CONTENTS 
PAGE 
1. Introductory . ‘ : ‘ : : : ; ‘ : : 3 Jil! 
2. Notation and terminology : , : - 3 : : : - 113 
3. The testing of statistical hypotheses : : : : : ‘ . (114 
4. Definitions of properties of sufficient statistics : 117 
5. The existence of uniformly most powerful tests when the alternative iepaheces 
depend upon more than one independent parameter. . 122 
6. Conclusions to be drawn from the existence of a uniformly most eat tet 124 
7. Further illustrative examples . ; : é 2 : ; eeel2s 
8. Conclusions . : c apOr es : : é . : : : . 136 


1. Introductory. 


In the following paper we have found it necessary to employ methods of 
reasoning which may appear of rather unfamiliar character to many statisticians. 
The question, however, as to what conclusions regarding sufficient statistics may 
be drawn from the existence of uniformly most powerful tests, or vice versa, is 
essentially one which concerns the properties of functions representing the 
probability laws. It is inevitablé, therefore, that a paper dealing with this pro- 
blem should bear some mark of the theory of functions, in spite of its concern 
with a statistical question. 


2. Notation and terminology. 

The probability of an event, HZ, will be denoted by P{H}. The relative pro- 
bability of #, given H,, or the probability of H, calculated on the assumption 
that EH, has occurred, will be denoted by P{E, | £,}. 

Pp (x1, ---» L_ | O,, --. 0) will denote the elementary probability law of random 
variables z,, ... 2,,, depending upon J known or unknown parameters 6,, ... 9). 
We shall consider the case only where the variables 2, ... x, are continuous, Le. 
such that whatever be z,’, ... 2,’, P{(z,=2'), ... (%_=2y')}=0. 

E will be used to denote a point, termed the feral point, in the m-dimensional 
space, say W, having z,, ... x, for its coordinates. W will be termed the sample 
space. Let w be any region in W. The probability law p (2, ... x, | 6,, ... 9;) will be 
defined by the following property: whatever the region w, the probability 
P{Eew|6,,... 6}, that E£ will fall within w, is given by the integral, in the sense 
of Riemann, of p (x, ... X, | 61, --. 9,) taken over the whole region w, or 


P{Bew|6,,...6}= ls | Plers2 Pale eed die == oT) 


PSM 8 


114 Sufficient Statistics and Uniformly Most Powerful Tests 


This definition implies that p(x,, ... , | 61, -.. 9;) is a continuous function in 
almost the whole space* W, and non-negative in any point in which it is con- 
tinuous. We may assume thatit isnever negative. The function p (z,,...,,|9;,--.% 
is determined by (1), except perhaps in a set of points of measure zero where it 
may be arbitrary. We may however determine it in the whole sample space by 
assuming it to be continuous wherever the condition (1) permits and zero else- 
where. With this convention, whenever p (x1, ...2,,| ,,--.) is positive, it will 
be continuous with regard to z,,...7,. 

The expression p(2;,,...%, |X ,,-.-.Z,_,) will denote the relative elementary 
probability law of z,,...%, given 2,,...%,_,, or in other words the elementary 
probability law of z,, ... z, calculated on the assumption that 2, ... x,_, have 
some definite values. , 

It is easy to establish the following identities: 


(a) P (Xy5 +++ Ly) =P (Ly, +++ Te_y) P (es +--+ Ln | Lr» +--+ Lea); (2) 
(b) Plt, a=|--[R (2, , --. eg/llx, --.dz,,, (3) 
where the integral extends from —oo to +00 for all variables 2,, ... 2,. 
If the equations : 
a Yi=Y; (Ty, --- Ly) (1=1, 2,...) (4) 
form a one-to-one transformation of the space W of the x’s in the space W’ of 
the y’s, if the partial derivatives 0z,/dy; (1, j=1, ... n) exist and are continuous 


and if Aw? 12 22 Zn) 


5 
0 (Y13 +++ Yn) ( ) 
does not change its sign and differs from zero in W’ (except perhaps in a set of 
points of measure zero), then 


P(Y1> +--+ Yn) =P (21, --- Zp) | A, (6) 
where there should be substituted on the right-hand side of (6), instead of the 
x’s their expressions in terms of the y’s obtained from (4). 


3. The testing of statistical hypotheses. 


Any assumption concerning the probability law of a set of variables is called 
a statistical hypothesis. This is called simple if it specifies the probability law 
completely; otherwise it is called composite. Any test of a statistical hypothesis 
may be regarded as a rule for the rejection of the hypothesis tested when the 


* T.e. except perhaps in a set of points of measure zero. 

+ The above terminology, e.g. sample point and sample space, etc., may suggest that the z’s 
we consider denote necessarily independent observations of one or more random variables. In 
such case p (x,, ...-2, | 6,,... 6,) would be a product of several functions of similar form and 
differing only in their arguments. 

This is a limitation, which is frequently assumed. However, the nature of the problems considered 
below does not require this limitation in any way and the reasoning which follows applies equally 
in cases where the z’s are mutually dependent or not. 


J. NEYMAN AND kh. S. PEARSON 115 


observed sample point falls within a specified region, w, called the critical region, 
and for its acceptance in other cases.* 


In testing statistical hypotheses errors of two kinds may be made: 
(1) We may reject the hypothesis tested when it is true. 
(2) We may accept it when it is false. 


The probability of the first kind of error determined by the hypothesis tested, 
say H, has been called the size of the corresponding critical region,} and is given 
by P {Hew | H}. Two tests based on critical regions of the same size have been 
called equivalent. The probability of rejecting H when the true hypothesis is an 
alternative simple hypothesis, say H’, or P {Hew | H’} has been termed the power 
of the test with regard to H’. The most powerful test for H with regard to H’ is 
the test whose power is greater than that of any other equivalent test, and the 
critical region yielding the most powerful test with regard to H’ has been termed 
the best critical region for H with regard to H’. 

If we denote by Q the set of all simple hypotheses which are considered admis- 
sible, then the test of a hypothesis, H,, has been called uniformly most powerful 
with regard to Q, if it is the most powerful test with regard to every hypothesis 
included in Q, alternative to H,. 

Write p (x,,...x, | Ho) and p(2,,...x, | H’) for the probability laws of the x’s 
determined by two simple hypotheses, H,, that which is tested, and H’, an 
alternative. In recent papers we have discussed the problem of the most powerful 
tests of statistical hypotheses, t and have shown that the region wy defined by the 
inequality 


DATs snot, a pe kp (2,-2.2, | tg), (7) 
where k> 0 is a constant chosen so that . 
P{Hew,| Ho} =« (8) 


is the best critical region with regard to H’ having size «. 

The existence of a uniformly most powerful test with regard to the whole class 
of alternatives depends on the inequality (7). If after substituting the proper 
value of k determined from (8), the inequality is independent of H’, then the 
region defined by (7) is the common best critical region with regard to the whole 
class of alternatives. But it may happen that (7) depends on H’, and in this case 
no uniformly most powerful test exists. We have given examples of both situa- 
tions. 

The problem of the most powerful tests, in the case where all admissible hypo- 
theses are associated with a probability law of the «’s of the same form, differing 
only in the values of certain parameters involved in the probability law, has been 


* Sometimes it is true we may decide to remain in doubt, and in this case the rule will indicate 
a division of W into three instead of two regions. 

+ J. Neyman and E. S. Pearson, Phil. Trans. Roy. Soc. A, vol. coxxx1 (1933), p. 289; Proc. Camb. 
Phil. Soc. vol. xxrx, part 4 (1933), p. 492. 

t Loc. cit. 


116 Sufficient Statistics and Uniformly Most Powerful Tests 


discussed by R. A. Fisher,* who connected it with the theory of the so-called 
“sufficient” statistics. He stated that in cases where a uniformly most powerful 
test exists: 


(1) A sufficient statistic must exist and be constant on the boundary of the 
corresponding best critical region. 


(2) The number / of independent parameters which are specified by.) the alter- 
native hypotheses must be one only. 


In the present paper we propose to discuss the same question rather more fully. 
It would appear that the original conception of a sufficient statistic needs some 
extension and classification, and that Fisher’s argument must be modified, since 
his statements just quoted, if taken in full generality, appear to be inexact. 

We should like to emphasize at this point that we are not concerned in the 
discussion which follows with the use of sufficient statistics in problems of 
estimation, but rather with their bearing on the theory of testing statistical 
hypotheses. In treating this latter problem we have found it necessary to use not 
only the conception of sufficient statistics as introduced by R. A. Fisher, but to 
introduce also some new conceptions which, as far as we are aware, have not been 
considered before, namely the conceptions of a sufficient set of statistics and of a 
shared sufficient statistic. Again, we are not concerned with the bearing on 
Fisher’s theory of estimation of the new functions which we define. 

We believe that our definition of a “specific sufficient statistic”? corresponds 
to Fisher’s conception of a sufficient statistic, but though he has written on 
sufficient statistics in several places the definitions he has given appear, in our 
opinion, to leave some room for misunderstanding. In his paper “On the mathe- 
matical foundations of theoretical statistics’’} there is a section headed “ Defi- 
nitions’ in which the following is found: 

Sufficiency. A statistic satisfies the criterion of sufficiency when no other statistic which can be 
calculated from the sample provides any additional information as to the value of the parameter 
estimated. 

This definition contains the term “information”? which is not previously 
defined. Later, however, in the same paper (p. 316) Fisher writes: 

The complete criterion suggested by our work on the mean square error{ is: 


That the statistic chosen should summarize the whole of the relevant information supplied 
by the sample. 
This may be called the Criterion of Sufficiency. 
In mathematical language we may interpret this statement by saying that if 6 be the parameter 
to be estimated, @, a statistic which contains the whole of the information as to the value of 6, 
which the sample supplies, and 6, any other statistic, then the surface of distribution of pairs of 


* Proc. Roy. Soc. A, vol. cxuIv (1934), p. 285. 
+ R.A. Fisher, Phil. Trans. Roy. Soc. A, vol. coxxm (1922), p. 309. 
t R. A. Fisher, “A mathematical examination of the methods of determining the accuracy of 


an observation by the mean error and by the mean square error”, Monthly Notices of R.A.S. 
vol. Lxxx (1920), p. 758. 


J. NEYMAN AND E. S. Pearson 117 


values of 4 and 6,, for a given value of 6, is such that for a given value of 6,, the distribution of 6, 
does not involve 6. In other words, when 9, is known, knowledge of the value 6, throws no further 


light upon the value of @. 

In our notation 6, and @, are 7, and 7. This we have regarded as Fisher’s 
mathematical definition of a sufficient statistic and we believe that it is equivalent 
to our Definition IT. 

We have hitherto defined the sample space, W, as the whole space of the 2’s, 
including that is to say every point specified by a system of real values of Diels Oa 
It will, however, be convenient in what follows to use the same expression in a 
somewhat narrower sense. The space W to which the following propositions relate 
will be defined as the set of points in which the probability law p(2,,...2,|6,,.-- 9), 
as determined by at least one of the admissible hypotheses, is not zero. It will be 
seen that with this definition the sample space may be limited. In fact, it will not 
contain points, say 2,’, ... X,’, in which p (2,’,...2,' | 6;, ... 6,)=0 for every set of 
values of the 6’s specified by the different admissible hypotheses. 


4, Definitions and properties of sufficient statistics. 

We shall start by defining what we shall term a statistic and more particularly 
a sufficient statistic. We shall need to consider sufficient statistics of different 
classes, the first of which is, we believe, the class originally defined by Fisher. 


Definition I. If a function, 7’, of random variables 2,, ... x, possesses the 

following properties: : 

(a) T is defined and single valued at almost every point of the sample space W, 

(6) whatever be a number 7”, the locus of points in the sample space in which 
T < T’ is such that the probability law of the x’s may be integrated over it, 
giving the probability P{T < 7"},* 

(c) there exist such values, 7’, that the locus of points, W(T7"), in which 
T=T" is of at least (n—1) dimensions, i.e. one less than the number of 
dimensions of the sample space W, 

(d) T does not depend upon any unknown parameters which may be involved 
in the probability law, 

then it will be called a statistic. 

In formulating this definition we have tried to cover the case of statistics 
which are, at present, in common use. Frequently they are continuous functions 
possessing derivatives of all orders at all sample points. Such, for instance, are 


the mean %= S (x;)/n and the variance s?= & (w—Z)?/n. On the other hand, 
i=1 i=1 


“Student’s” ratio z=2Z/s is not continuous and has no sense at any point of the 
line z,=2,=...=%,. This shows that it would not be reasonable to apply the 


* Such functions are called ‘‘measurable”. 


118 Sufficient Statistics and Uniformly Most Powerful Tests 


term statistic to continuous functions only. It may, however, be useful to require 
continuity and differentiability almost everywhere in W, but we do not insist 
on this point. 

We have thought it necessary to require that condition (b) should be fulfilled, 
since statistics for which it is impossible to calculate the probability laws would 
be of no practical value. 

The meaning of the condition (c) may be illustrated on the following example. 
Consider the locus of points, say W (%,s), in which both % and s are constant. 
This is the intersection of the hypercylinder, s?=constant, and the prime, 
%=constant, and its dimensions are therefore (n— 2). Now if T is a function of 
the x’s such that the locus of points in which it is constant is W (z,s), then we 
shall not consider it as a statistic. On the other hand, we shall consider W (Z, s) 
as determined by the values of two statistics, % and s. 

Clearly for a function, 7’, to be a statistic we could not require that for any 
value 7” the locus W (7’) be of (n—1) dimensions, as such a condition would 
eliminate a number of statistics in common use. For instance, for any number 
T’ <0 the locus of points in which s?= > (x;—%)?/n=T” does not contain points 

i=1 
at all. Further, the locus of points in which s? = 0 represented by the straight line 
L,=%,=...=2, is only of one dimension. 

Again, it would not be reasonable to require that the locus, W(T), be of 
exactly (n—1) dimensions. In a number of practical problems we do, in fact, 
consider statistics which are such that the locus is of n dimensions. To illustrate 
this point consider the case where we have m independent observations, 
X1, Lg, -.. X,, Of a Single random variable which follows a normal law with known 
standard deviation. For the purpose of testing hypotheses regarding the mean 
associated with this law and for the purpose. of its estimation, we can and do 
sometimes use a statistic T’ which is equal to the number of the z’s having a value 
greater than some fixed value, say X.* The locus W (7’) in the sample space in 
which this statistic is constant is of n dimensions, i.e. of the same number as the 
sample space. This point can be made clear by considering the case n=2, as 
suggested in Fig. 1. 

Here the sample space is represented by the plane of x, and x,, and 7' assumes 
constant values of (i) 0, (ii) 1, and (iii) 2, respectively, for (i) 27,<X, 7,<X, 
(ii) 7, <X,2,>X, or x, >X, x, < X, and (iii) x, > X, z,> X. All four of these loci 
are of two dimensions. 

In the condition (d) a distinction is made between an unknown parameter, Say 9, 
and its definite value, say 6 = 6), specified by some hypothesis. The object of this 
limitation is merely to make sure that it will be possible to calculate the value of 
T at almost all points of the sample space. If, for example, we write the expression 


* This is usually done when, for some reason, we are unable to ascertain the actual values of the 
z’s but can count the number of those which exceed the value X. 


J. NEYMAN AND E. S. Pearson 119 


~=(%—6)/s and state that @ is the unknown population mean of the x’s, Z being 
the mean and s? the variance in a sample, then we are unable to calculate the 
value of z, although % and s are known. Such an expression will not be called a 
statistic. On the other hand, if we substitute into z instead of the “unknown 
parameter 6” its numerical value, say 6) = 1, specified by some hypothesis, then 
z= (%—1)/s becomes a function of the x’s determined in every point of the sample 
space (except in a set of measure zero for which z,=2z,=...=2,), and we shall 
call it a statistic. 

It may be noticed here that some writers use the words “statistic” and 
“estimate” as synonymous. In our view a distinction should be made, the term 
“estimate” being used only with regard to such statistics which, for one reason 


FIG. 1. 


or another, are selected for estimating the value of unknown parameters involved 
in the probability law of the random variables considered. As we are, however, 
not concerned here with problems of estimation we need not enter into any 
details concerning “‘estimates”’. 

We hope that as a result of this somewhat lengthy explanation the meaning of 
our Definition I will be understood and further that it will be found to be not 
too narrow. 


Definition II. The statistic T is called a specific sufficient statistic with regard 
to the parameter 0, if, whatever other statistic 7’, be taken, the relative pro- 
bability law P (7, | 7',) of T2, given Tj, is independent of 6,. This we believe to 
correspond with Fisher’s original definition of a sufficient statistic. We have 
added the adjective “specific” for convenience in comparison. 

Definition III. The statistic T, is called a shared sufficient statistic of the 
parameters 6,, ... 9, if, whatever other statistic 7’, be taken, the relative pro- 
bability law of T,, given 7',, is independent of these gq parameters, while it 
depends on the remaining /—q parameters 0,,,, -.. 9}. 


120 Sufficient Statistics and Uniformly Most Powerful Tests 


It will be noticed that the conception of a shared sufficient statistic is narrower 
than that of a specific sufficient statistic. In fact, if 7, is a shared sufficient 
statistic of say two parameters 6, and 6,, then it will satisfy the definition of a 
specific sufficient statistic of 6, or 6, taken separately. On the other hand, a 
statistic which is known to be specifically sufficient with regard to 0, need not be 
a shared sufficient statistic of 0, and @,. 


Definition IV. A set of m algebraically independent statistics: 7',,...T,, 
(i.e. such that none of them can be presented as a function of the others) is called 
a sufficient set of statistics with regard to parameters 6,, ... 0, if, whatever be any 
other statistic, 7’, the relative probability law of 7, given Tj, ... T,, is in- 
dependent of 6,, ... 4,. 

It will be seen that, starting with this definition, it would be possible to carry 
on the subdivision of the conception of a sufficient set of statistics. We shall, 
however, leave this subject to be dealt with elsewhere. 

Neyman has recently proved the following proposition :* 


Proposition I. The necessary and sufficient condition for a statistic to be 
specifically sufficient with regard to a parameter 6 (in the sense of Definition IT) 
is that in any point of the sample space (as defined on p. 117), except perhaps for 
a set of measure zero, it should be possible to present the probability law of the 
x’s in the form of the product 


P (X15 +--+ Ly | 0)=p [7 | 8) ¢ (1, +++ Lp) | 7=T (ay, ...2n)> (9) 
where p(7'| 6) denotes the probability law of 7’, and ¢ is a function of the z’s, 
independent of 0. 
This necessary and sufficient condition may also be put in the following form: 


Proposition Ia. The necessary and sufficient condition for a statistic T' to be 
specifically sufficient with regard to a parameter 6 is that it should be possible to 
present the probability law of the z’s in the form of a product 


P (1, +++ Ep | OY=f(T, 8) $ (a, --- Xp), (10) 
where the function f depends upon T' and 6, and ¢ does not involve 6, the above 
equality holding good in all points of the sample space, except perhaps in a set 
of points of measure zero. 


* Giornale dell’ Istituto Italiano degli Attuari, vol. v1. no. 4 (1935). See also G. Darmois, Comptes 
Rendus, t. co (1935), p. 1265. 

+ The sufficiency of the conditions expressed in Propositions I and Ia is evident, and it has been 
used by Fisher (e.g. in Proc. Roy. Soc. A, vol. cxLIv (1934), p. 289). The necessity of the conditions 
is not, however, so evident, and this forms the main topic of Neyman’s paper quoted. The proof 
there given concerns both the case of continuous and discontinuous variables. In the latter case 
the equations (9) or (10) must hold good in all points where p (2,, ... 2, | 9) is not zero. In the case 
where the variables are continuous it was also required that 07/éz; (i=1, ... n) should exist and be 
continuous, and that at least one of these derivatives should differ from zero in almost the whole 
sample space. It is believed, however, that these limitations are not essential for the validity of 
the theorem. 


J. NEYMAN anv E. S. Pearson 121 
We may now state the following further propositions: 


Proposition II. For a statistic T to be a shared sufficient statistic of parameters 
6,, ... 6,, it is necessary and sufficient that in almost every point of W, it should 
be possible to present the probability law of the x’s in the form of the product 

p(x,,-..%_ | O,,... is --- O)=p(T|4,,.. - 8,)¢ (x, «-. 2 Ens 6021>- -- 9) |r=7 (2, ...20)» 
(11) 
where p(7'| 6,,...6,) is the probability law of 7’, and the function ¢ does not 
depend upon 4), ... ,. 


Proposition III. For a set of algebraically independent statistics 7',, ... T', to 
be a sufficient set with regard to the parameters 0,, ... 0,, it is necessary and suffi- 
cient that in almost every point of W it should be possible to present the pro- 
bability law of the z’s in the form of the product 


P (Xy, +++ Lm | A, -.. O55...) =p (Ty, --- Tm | 915 --- 9g) P(r, +++ Uns Deas «++ M), 
(12) 
where p(7';,... 7’, | 8,, ... 9,) is the probability law of T,, ... T,, and the function 
¢ does not depend upon 6, ... 0,. 

Here again equivalent forms of the conditions expressed in Propositions IT 
and ITI will be obtained if instead of p(T’ | 6,, ...0,) and p(7,,... Tm| 91, --- 9) in 
equations (11) and (12) were substituted any functions f,(7',0,,...0,) and 
fo(T;.-. Tm, 9,,--.6,) respectively, nct necessarily probability laws. Proposi- 
tions IT and ITI in such transformed form will be referred to as Propositions IT a 
and Illa respectively. 

Since the method of proof of Propositions II and IIT is identical with that given 
by Neyman for Proposition I, we shall omit the proof and refer the reader to the 
publication quoted. It should be noticed that while in the present paper we 
consider only the case where the variables x,, ... x, are continuous, the Defini- 
tions II-IV and the Propositions I-III hold good in the most.general case, except 
that for discontinuous variables the equalities (11) and (12) must hold good i in all 
points where the left-hand side is positive. 

In the course of the following pages examples will be given illustrating these 
definitions of sufficient statistics. Our main purpose is, however, to consider what 
connection exists between such statistics and uniformly most powerful tests. In 
doing so we shall limit ourselves to the case where the hypothesis tested is simple, 
and specifies the value of one or more parameters involved in the probability law 
of the x’s, whose functional form is assumed known.* We shall start with the 
assumption that a uniformly most powerful test with regard to a class Q exists in 
the general case where the alternative hypotheses of 2 depend upon several, say 
q, independent parameters. Since, however, this assumption contradicts the 
second of Fisher’s statements quoted on p. 116, we shall start by giving an example 
where it is true. 

* For the definition of a simple hypothesis see p. 114. 


122 Sufficient Statistics and Uniformiy Most Powerful Tests 
5. The existence of uniformly most powerful tests when the alternative hypotheses 
depend upon more than one independent parameter. 

Example I. Suppose that each of the independent random variables, 2;, 
follows the probability law 
p(x;)=peh™ forz,>y. (13) 

p (x;)=0 for 4;<y. (14) 


— 


Y (p > kp.) W,() (p.< kp) 


WZ Lier" i 
Na W.(y.) and Woly,) (p,= 0 = kp.) 


oO x, 
KA 
EXAMPLE WITH EXPONENTIAL LAW; THE 


BEST CRITICAL REGION IS SHADED. 
FIGURE. 2. 


We shall assume that the class 2 is composed of hypotheses for which y < yp) and 
B>B)>0,y and £ being thus two independent parameters. The simple hypothesis 
to test, Hy, assumes that y=y), B=B). Any alternative, H’, will thus assume 
either that y<yo, B=Bo; y<y, B> Bo; Or y=, B>Bo- Fig. 2 illustrates the 
situation in the case of a sample of two (n= 2), a single alternative to H, only, 
say H, with y=y,, B=8,, being represented. 

The joint probability law for n independent variables 2,, ... x, is given by 
2 a BS (x =a) 


P(x, +++ &, | 'y, B) = B” (15) 


a NEYMAN AND E. S. PEARSON 123 


in the part of the sample space, say W., (y), determined by the inequalities 

apy  (t=1,2,...n), (16) 
and will be zero in the remainder of W, which we shall denote by W,(y). To 
simplify notation we shall write below po for p (a, ... %,| 79, Bo) and, if either 
Y#¥o or BH By or both, p, for p (x,, ... x, | y, B). 

The best critical region, w, for any alternative H’ will be defined by the 
inequality p1>kpo, (17) 
where k is a constant to be determined, so that P{Hew|H,}=«. It will be seen 
that w must contain all points of W, (yo), i.e. in which p,=0, since all such points 
satisfy (17) whatever be the corresponding value of p,. Besides these points of 
W (yo), the best critical region will contain certain points of W,, (yo) in which 
Po> 0. 

Since y)>y, then p,>0, whenever p,>0, or in other words whenever py is of 
the form (13), p, must be of similar form. It follows that the part of the region 
W , (yo) included in the best critical region is defined by 

pre-B @-y) > kB,"e-"Bo @—yo), (18) 
k being a constant to be determined later, and % denoting the mean of the n 
values of x. The formula (18)-is equivalent to 


(B— Be) @<yB— yo Bo —— log k + log (B/). (19) 


According to the conditions of our problem B may be either equal to or greater 
than f,. In the first case the inequality (18) does not imply any limitation of the 
shape of the part of W_, (v9) to be included in the best critical region, which may 
therefore be chosen arbitrarily, subject to the unique condition that 

P {Eew | Hy} =a. 
In the case where f > f, the inequality (19) may be written 


z< {yP—yoPo—Zlogk + log (A/.)| [(B—Bo)=H (say), (20) 


and does imply a limitation. This limitation is, however, of the same character 
whatever the values of y and f specified by the alternative hypothesis, H’, 
included in Q, provided B>f,. Using the known distribution of the mean in a 
sample from an exponential distribution* and the Tables of the Incomplete Gamma 
Function,} it is easy to determine the k, of (20) such that P{Hew| yo, By} =«. 
This quantity, &,, will be a function of a, y) and f, only, so that the part of 
W.. (yo), defined by (20), to be included in the best critical region depends on the 
parameters y, and f, specified by Hy, but is independent of the alternative H’. 

* See for example Neyman and Pearson, Biometrika, vol. xxA (1928), p. 223: 

P(E| ro)= Et Bp eE0. 
+ Karl Pearson, Tables of the Incomplete Gamma Function. 


124 Sufficient Statistics and Uniformly Most Powerful Tests 


It follows that if we include in the best critical region 

(1) the whole of the region W, (yo), where py = 0, 

(2) the part of W., (yo) (in which Po > 0) determined by the inequality (20), 
we obtain a region, say w,, which is a best critical region for H, with regard to any 
alternative H’ of Q specifying the two independent parameters y and f, provided 
that Q is restricted to the class y<y9, B> By. 

We conclude that the existence of a uniformly most powerful test with regard 
to a class, Q, of simple hypotheses does not require that the number of independent 
parameters specified by the alternatives should be necessarily equal to one. The 
statement to the contrary, as made by Fisher, is therefore in general not correct. 


6. Conclusions to be drawn from the existence of a uniformly most powerful test. 

Suppose that the probability law p(a,,... x, |0,,... 0), or for short p, (£), 
where 2,, ... x, are the coordinates of the point #, depends upon / independent 
parameters 6,, ... 6,, whose values for all admissible hypotheses of a set 2 are 
contained in certain intervals, limited or not. The hypothesis tested, H,, specifies 
one such system of values, 6,°, ... 6,°, and to shorten the notation we shall write 
Po(L) for p (x1, ... %,, | 1°, ... 6,°). Denote by W., and W, the parts of the sample 
space W in which p, (£)>0 and p, (H) =0 respectively. 

Assume that there exists a uniformly most powerful test, and denote by « the 
size of the corresponding best critical region, w («). This region, besides including 
all points of Wy, in which p,=0, will contain a part, say v(«), in which py > 0. 
It is clear that in any point, E, of v («) we must have p, (#) > 0 for every system of 
values of the 6’s, for otherwise £ could not be included in the best critical region 
within which it is known that p,>kpy. v(«) will be called the positive part of 
the best critical region. 

We shall say that E is a point belonging to the positive boundary L(«) of 
w (a) if: 

(a) E belongs to v (a), 

(6) in the vicinity of £ there is at least one point HL’ belonging to w(«), and at 

least one point E’’ lying outside w(«), both H’ and E”’ being different 

from E. 


Proposition IV. If w(«) is the best critical region for Hj common to all alter- 
natives included in 22, and if #, and E£, are any two different points of the positive 
boundary L(«), then, for every fixed system of values of 6,, ... 6,, we shall have 

Py (£)/Po (Ly) = 1 (2) /2o (Ee). (21) 

Fix a system of admissible values of the 6’s, say 6;=6,' ({=1, ... 1). The best 
critical region, w(«), being common with regard to all admissible hypotheses, 
there must exist a constant, say k(6,’, ... 6,'), such that 

py, (£)>k (0,’, ... 8;') po (#) if EF is within w (a), (22) 
Pi (£)<k (6,’, ... 6;’) po (£) for all points EZ outside w (a). (23) 


J. NEYMAN AND E. S. Pearson 125 

If H; (1=1, 2) lies on the positive boundary L («), then both p, (E,) and p, (E;) 
are positive in #; and therefore continuous. Since in the vicinity of Z, there must 
be a point, £,’, lying within w(«) where (22) is true, and also a point E,’’ where 
(23) is true, it follows that ; 

Pi (Ej) =k (8y', ... 6) po(E;) (t= 1,2), (24) 
and consequently (21) is true and the proposition proved. 

Assume now that for any value of «a, 0<a<a,)<1, there exists a uniformly 
most powerful test for which the size of the best critical region is «. (In the 
majority of problems «= 1, but this is not necessarily the case.) Denote by S the 
set of uniformly most powerful tests such that to any «, 0<«<« , there corre- 
sponds one and only one test of S. This set will be called a system of tests associated 
with the limit «). Denote by w(«) the best critical region of size « corresponding 
to a test of S. We shall say that the set S is ordered if for any 0 < a, < %» <a the 
best critical region w («,) is included in w («,). 


[J] Vo 
u" 
y" 
Lu Ves 


Fig. 3. 


Proposition V. If a system, S,, of tests exists and is not ordered, then it is 
possible to find another system, S,, associated with the same limit 9 , which will 
be an ordered system. 

Proposition V is a simple corollary of the following fae 

Lemma I. If «<a, and w(«,) and w(a,) are two best critical regions common 
to all alternatives, of size «, and «, respectively, and further if w(«,) contains a 
part, say v’, which is not contained in w(«,), then it is possible to find a region, 
w' (a,), which will be a best critical region common to all alternatives, will be of 
size «,, and will be contained in w (a). 

w(a,) and w (a) will both include all points of W,, in which py=0. If w(a,) is 
not wholly contained in w («,), the positive parts, v(x), of these two regions may 
be expressed as o(a,)=05+0' 

| fee: 
where v, v’ and v” are regions without common points as shown in Fig. 3. 
Whatever the admissible system of values 0,, ... 6,, we must have 


P(E) > ky po (E) (26) 


(25) 


126 Sufficient Statistics and Uniformly Most Powerful Tests 
in every point, EZ, of w(«,) and therefore of v’, and 
P(E) < ky po (£) (27) 


everywhere outside w(a«,) and therefore within v’’. On the other hand, within 
w (a), and therefore within v”’, 


Se Ne aa as oe 
ke Po (£) <p; (LE) < ky po (£), (29) 
and thus k, <k,. Further, outside w(a,) and therefore within v’, 
Py (EL) < kepo(#), (39) 
which together with (26) leads to 
ky Do (HZ) <p; (E) < ky p (#) (31) 


within v’. We conclude that k, < k,, and as it has already been shown that k, <k,, 
it follows that k, =k,=k, say. Consequently in view of (29) and (31) we must have 
P(E) =kpy (E£) (32) 
within both the regions v’ and v”’ for every system of values of 6,, ... 4;. 
We shall now show that in place of v («,) it is possible to find another region, say 
u(a,), such that w’ (a,)= W,y+u(a,) has the following properties: (1) it is a best 
critical region with regard to all admissible alternatives, (2) it is of size «,, (3) it is 


included in w (a2) = Wy + (a). 
Denote by v’”’ any part of v’’, such that if w(«,)=v)+ 0’, then 


{heal ’ a Ee a = Os (33) 
Clearly equation (33) is equivalent to the following: 
be [ Po (£) dz, ... dx,=|...[ Po (L) dx, ...dz,. (34) 


Since the integral of p)(H) over the whole region v(a,)=v)+v’’ is equal to 
%) > %,, it is always possible to find such a region v’”’. Thus the region 


w" (04) = Wotu(ay)=Wotrytr'”’ 


will have the properties (2) and (3); we shall show that it must also have the pro- 
perty (1). For this purpose it is sufficient to show that 


jes) Di (B) day... dx, |...| Py (E) day ...dx,. (35) 


This equality, however, follows directly from (34) and (32), the latter being 
true within v’ and v” and thus within v’”’. 
Lemma I is therefore established and consequently Proposition V is proved. 


J. NEYMAN AND E. S. PEarson 127 


We shall now define what will be termed the boundary space, B. The definition 
will apply only to cases where a system, S, of uniformly most powerful tests 
exists; under these conditions we may assume that the system S is ordered. 

The boundary space B corresponding to the system S will mean the set of 
points, #, having the property that each belongs to the positive boundary of at 
least one best critical region associated with the tests included in 8S. 


Proposition VI. If a system, S, of uniformly most powerful tests exists, then 
within the entire boundary space, B, we shall have identically 


Pi (LE) =k (T, 4, ... %) Po (LB), (36) 
where T' is a single valued function of the z’s defined in B and k(7',6,,...6,) a 
function of T and 6,, ... 6,, not explicitly depending on the z’s. 
To prove this proposition we shall define a function, 7’, and then show that 
whatever be the admissible values of 6,, ... 6,and whatever two points Z, and E, 
be taken in which 7 has the same value, then 


Py (Ey) /Po (£3) = Py (E2)/Do (Es). (37) 
Let E’ be a point of B. It may belong to only one positive boundary, say 
L(«’), orit may belong to several. In the first case we shall ascribe to the function 
T the value 7'(H’)=«’. In the second case denote by «” and «’”’ the lower and 
upper bounds of values of « for which L («) contains ’. As the system S is ordered, 
it follows that whatever be the value of « included between «” and «’’’, the positive 
boundary L(«) must contain E’! We shall ascribe to the function 7' the value 


7 (E’) — 3 (a’’ ate ed 
It is seen that in each point of B the function T is defined and has a unique 
value. Further, if 7’ has the same value in any two points HL, and FH, of B, this 
implies that these points belong to the same positive boundary, say L(«).* It 


follows from Proposition IV that equation (37) must hold good, and the proof is 
thus completed. 


Proposition VII. If a system of uniformly most powerful tests exists and if the 
corresponding boundary space, B, differs from the sample space W by no more 
than a set of points of measure zero, then there must exist a statistic, 7’, which 
will be a shared sufficient statistic of the parameters 0,, ... 6,. In particular, if the 
number of parameters specified by the admissible hypotheses of the set Q is /=1, 
then under the conditions stated there must exist a specific sufficient statistic 
of the only unknown parameter 6,. 

It will be seen that this proposition follows as a simple corollary from the 
previous propositions. In fact, if B differs from W no more than by a set of points 


* It will be noted that two points may belong to the same positive boundary and yet the value 
of T may not be the same at both. This may occur if, say, E, is a point common to several positive 
boundaries. 


128 Sufficient Statistics and Uniformly Most Powerful Tests 


of measure zero. then it follows that the equation (36) hoids good almost every- 
where in W, which means according to Proposition II that 7 is a shared sufficient 
statistic of 6,, ... 6,.* 

It has therefore been established that a shared or a specific sufficient statistic 
will exist if (1) a system of uniformly most powerful tests exists, and (2) the 
boundary space associated with this system is, what might be termed for short. 
almost identical with the sample space. It is of interest to note that a imitation 
as in (2) is necessary, since a system of uniformly most powerful tests may exist 
without the existence of sufficient statistics. To illustrate this point we shall refer 
to two examples, in the first of which the hypotheses belonging to Q specify the 
values of two independent parameters, in the second the value of only one. In 
both cases a system of uniformly most powerful tests exists with a limit «)=1, 
while there is no sufficient statistic. 

7. Further illustrative examples. 

Example Ia. Consider the situation described in Example I above. The 
hypotheses included in 2 specified two independent parameters y and 8, and 
writing n¥=2,+...+2, we had 

Git ta l= pte Pe ey for z;>y, (t=1,...n), (38) 
and p(2,,.-. 2, | y, 8) =0 if for at least one value of 1=1, ...n,4;<y. (39) 

The set {2 was determined by the inequalities y<y) and B> By, yp) and B, >0 
being certain fixed numbers, and the hypothesis tested, Hj), was that y=yo, 
B= 6,. It was shown that a uniformly most powerful test for Hy) existed corre- 
sponding to any «<1. 

It will be found, however, that the condition of Proposition VII regarding the 
spaces B and W is not satisfied. In the first place the sample space W is not 
limited and extends from —0o to +00 for all variables x,, ... x,. For whatever 
be the point £’ with real coordinates 2,', ... x,,', it is possible to find a hypothesis, 
say H’, included in Q specifying a value y=y’, such that 

rma, (tend. sn: (40) 
and consequently making the probability law p (z,’,...,'| y’, B)>0in H’. 

On the other hand, the boundary space B must by definition be entirely con- 
tained in the part, W,, of the sample space in which p,(H)>0. No point, £”, 
having at least one of its coordinates 2,. ... 2, less than y»), belongs therefore to 
B, and it is clear that the set of such points E” does not possess a measure equal 
to zero. 

We shall now show in fuller detail that, as the general theoretical conelusions 
have led us to expect, no shared sufficient statistic of y and f exists. For suppose 
that there is such a statistic 7’, then according to Proposition II we must have 
in almost the whole of W, 

P(X, +++ Xn | yB)=p(T | yB)$ (ay, «+ tp), (41) 


* It is easily seen that 7' satisfies the Definition I. 


J. NEYMAN AND E. S. Pearson 129 


where ¢ is independent of both y and f. In particular if y, <y, and B,> Bo are 
two fixed numbers, 


Po (EL) =p (a, ... ees 
Pi (FE) =p (ay, ... @n| yi 8))=p(T | ¥1Bi) b (ty... y)J’ 


and consequently if at any two points, say H® and H® of W,,,the shared sufficient 
statistic 7’ has the same value 7',, then the ratio 


R(T) =D (E®)/p, (LE) =p (T, | Yo Bo)/P (7's | 71 Bi) (43) 
must have the same value in those points, i.e. for both i= 1 and 2. In other words 
the locus of points, W (7'), where 7 is constant must form either a part or the 
whole of the locus W (2) in which the ratio R is constant. This circumference will 
enable us to decide whether there is a shared sufficient statistic of y and f. In the 
present example the ratio R is of the form 


R= (Bo [ B,)* e—” (Bo—By) Z+-¥0 Bo—¥1 Br (44) 


in every point of W_,. Itis seen that within W_,, the locus of points, W (2), in which 
f is constant is identical with that in which the mean, Z, is constant. It follows 
that if a shared sufficient statistic, 7’, exists at all, either (i) it is constant through- 
out the whole of the prime =constant and changes its value when that of % 
changes; or (ii) each of the primes ¥=constant can be broken into a number of 
parts within which 7' has a constant value, which however differs from part to 
part of the prime and also from the values assumed by 7' in other primes 
x=constant. 

If the position (i) were true, then Z would itself be a shared sufficient statistic. 
The fixing of % would in fact mean the fixing of 7’, the probability law of Z would 
differ at most from that of 7' by a factor, depending on the nature of T' only and 
not on the values of y and f, and finally it would follow that if 7’ satisfies the 
conditions of the Proposition II, then % would do so also. , 

A superficial examination of equation (38) might suggest to the reader that 
T == satisfies the conditions of Proposition Ila with $(a, ...x,)=1. This, 
however, is not the case, and it is easy to show that if we write 


D (Xy, +++ Ly | YB) = Bre PPE $ (ay, «-. By), (45) 
and require that this equality should hold good in almost the whole of the sample 
space W (so as to satisfy the conditions of Proposition IT a), we shall find that the 


function ¢ must depend upon y. 
To see that this is the case, fix a sample point H, and choose the value of y= y, 
so that it does not exceed any of the coordinates of H,. Then the probability law 


of the x’s will be of the form (38) and 
(oy) vee Sy) =D (yy -+» | 718) BM CPMEM = 1. (46) 
‘Next increase the value of y to yz >, 80 as to exceed the smallest of the co- 
ordinates of the point E, , say %s_,. According to the definition p (%,,...%, | y28)=0 


PSM 9 


(42) 


130 Sufficient Statistics and Uniformly Most Powerful Tests 


and therefore the value of ¢ will have also to be zero. It follows that the value of 
¢ at the same sample point E, may be either 1 or 0 according to the value of y. 
As this result will apply to every point in the sample space, it follows that the 
function ¢ does not satisfy the conditions of Proposition Ila. 

The result would be similar if we were to try to present p(x,,... 2, | yf) as a 
product of p(%| yf) and another function, say ¢,, so as to satisfy the conditions 
of Proposition II. It follows that % is not a shared sufficient statistic of y and f. 

Let us now consider the alternative (ii). 

Denote by 7 the shared sufficient statistic, the existence of which we assume, 
and by p(Z'| y8) its probability law. Next select two points ZH and E® with 
coordinates («,', ... #,’) and (x,"",...”,'’), respectively, in which 7 has the same 
value, T,. Let x,,,/ and 2,,," be the smallest of the coordinates of the two points. 
According to Proposition IT, it follows that 


Pp (xy', sie Ly. | yB)=p(T, | vB) > (xy', ase Ly’) } 


D (21, ++ tn!” | yB) =p (Ly | YB) b te" -- tn”) 
where ¢ does not depend on y and £. It will be noted that the values of ¢ at the 
points EY and EH cannot be equal to zero, since if this were so the values of 
p (x1, ... X,| yf) at these points would be always zero, whereas for y< 2g‘, Ten" 
we know that it is positive. 

It may now be shown that 2,,,’=2,,,’. For if this were not the case and, for 
example, 2,,,/<4,,,’, then for a value of y=y,,‘such that x,,,’<y,<%sm, We 
should have p (x,’,... 2, | y,B)=0 and at the same time p (a,"’,...2,,"’ | y, 8) >9. 
Substituting these two values of p into the two parts of (47) and remembering 
that the values of ¢ in Z™ and E® are necessarily positive and independent of y, 
we obtain the contradictory results: p (7, | y,8)=0 and p(T, | y,B)>0. 

It follows that if 7’ has the same value in any two points H® and E#®, then the 
smallest coordinates of those points, 2,,,’ and 2,,,’’, must be equal. 

Ifin general z,,, denotes the smallest of the coordinates of a sample point Z, then 
the locus W (x,,,),in which ,,, = constant, will be built up of n parts (¢= 1, 2, ... 2), 
such as 


(47) 


XL; = Lem > for ¢= 1,2, ...¢—1,¢4+ 1, 0%. (48) 


Hy ZH 


sm 


Each part is a portion of a prime of the n-dimensional space, i.e. is of (n—1) 
dimensions; consequently W (2,,,), which consists of m regions each of (n—1) 
dimensions, is itself of (n — 1) dimensions. We have, however, already found that 
the mean, %, must have equal values in any two points in which 7' has the same 
value. It follows that the locus of points W (7) in which T is constant is at most 
of (n — 2) dimensions lying on the intersection of the prime % = constant with the 
locus W (x), %sm denoting the smallest of the coordinates of the sample point. 
This, however, means that the function 7 is not a statistic in the sense of our 
Definition I, and it follows that there is no shared sufficient statistic with regard 
to y and f in the example considered. 


J. NEYMAN AND E. S. PEearson 131 


; It can be seen, on the other hand, that % and z,,, as defined above form a suffi- 
cient set of statistics with regard to y and £; further, that if y has a known value, 
then Z is a specific sufficient statistic of B. 


Example II. The same points may be illustrated in the case where the admis- 
sible hypotheses depend upon a single parameter, and it will be found that the 
situation is very much as before. 

As an illustration we may suppose that in the preceding example f is any 
positive and decreasing function of y. For instance we may take y,=0 and 
B=1+y?. Then £)=1, and for y<+yp we shall have B> By. The probability law 
determined by any admissible value of y < y)=0 will be 
P(X, --- ty |y)=(lt+y?2)"eTH+A2G-) for a, >y (i=1, ... n) and 
p (x1, -.. Z, | y)=0 elsewhere } ae 

The best critical region cu-responding to any alternative hypothesis will be 
determined by the inequality p, >kp,, which reduces to 


(1+y?)” ety) 2@i-yH+i@) > k (50) 
or x < constant. (51) 


As in Example I it is seen that a system of uniformly most powerful tests 
exists. Similarly it follows that if a specific sufficient statistic T for y exists at all, 
the locus of points W (7’) in which it is constant must lie on the intersection of a 
prime %=constant with the locus x,,,=constant, which shows that T is not a 
statistic. On the other hand, % and z,,, will form, as formerly, a sufficient set of 
statistics with regard to the unique unknown parameter y. 

Having thus shown that the existence of either a shared or specific sufficient 
statistic is not a necessary consequence of the existence of a system of uniformly 
most powerful tests, we shall now show that the reverse situation is also not 
necessarily true. From the following two examples it will in fact be seen that 
either specific or shared sufficient statistics may exist and yet there may be no 
best critical region common to all the alternative hypotheses. 


Example III. We shall take for the probability law of x,, ... x,,, determined 
by any admissible hypothesis and for any system of real values of the 2’s, the 
expression 
1 -aa( E@o-m) -5 2 
Die, 245 8, | 0) = ee aca eh 

a (VW 2Q7)” 
where y and o> 0 are two independent parameters. Since the exponent of e can 
be expressed in the form 


1 
ae a ht 
9g? i ny) Af ; 
it follows that (52) is a special form of the multivariate normal law, x,—ny, 
9-2 


(52) 


(140%) 22425 (a—ny)a,+2 D (24,)} (53) 
4=2 4# j>1 


ive 


132 Sufficient Statistics and Uniformly Most Powerful Tests 


Xy, Lg, ... Z, being normally distributed about 0 and the standard deviations and 
coefficients of correlation being functions of the parameter o only. 
Suppose that it is wished to test the simple hypothesis, H,, defined by y=yo, 


o=0,)>0. The best critical region for Hy with regard to any alternative will be 
n 


defined by the inequality p, > kp), which on writing nz= & (x,) reduces to 


i=1 
%? (a? — a9) — 2% (vga — yoo”) > ky, (54) 
where k, is a function of k, n, yo, 09, y and a, and k is a constant to be determined 
so as to obtain a best critical region of desired size «. It follows from (54) that the 
best critical region, say w, is limited by the surfaces 


¥=c—d, X=c+d, ‘ (55) 
where C= (V9 02 — yo")/ (a7 — a9?) (56) 
and d is chosen so as to satisfy the condition regarding the size « of the region. 
It is necessary to distinguish three cases: ‘eo 


(1) o?>0,7; in this case w is determined by the inequalities 
Z<c—d,, X>c+d, (say). (57) 
(II) o?<«,?; in this case w is determined by 
c—d,<%<ct+dy,. (58) 
(III) o=a,; here w is determined by 
%(y—Yo) 2 ke. (59) 
It is clear that if the set O contains all hypotheses specifying any values what- 


soever of y and o> 0, then no best critical region for H, common to all the alter- 


natives of the set exists. 
Finally, it is easily seen that % is a shared sufficient statistic of the parameters 
y and o. Starting from (52) it is found that the probability law of Z is 


ae n - 
oad easO i er (60) 


and that in any point of the sample space we have 


P (21, +++ &q| yo) =P (B| yo) p (1, «++ Ln), (61) 
12 
1 -5 & (2,7) 
where Ly 5 +e6 L_) =———=—€ Aiwa" ’” 62 
> ( z ) n (WV 2a)" ( ) 


Since ¢ is independent of either y or a, it follows that Z is a shared sufficient 
statistic of y and o. 


Example IV. Here we shall show that a specific sufficient statistic of one 
unknown parameter may exist, while the best critical regions corresponding to 
any two alternative hypotheses are different. Take the probability law (52) 


J. NEYMAN AND KE. S. Pearson 133 


defined in the preceding example and put o=y, and suppose that Q is the class 
of hypotheses having y > 0. Then in any point of the sample space we shall have 
Lear} Bea) 
p(x p00 Ly ae — al 2 2y? 24 ° 
1 | y) y (V 2a)” se) 
If we now proceed to test the hypothesis that y=y,, it is found that the best 
critical region with regard to any alternative is defined by the inequality 


X (y? — 9) — 2% (y—Yo) yy > k. (64) 


(63) 


Two cases arise: 
(1) y>+yo; w is then determined by the inequalities (57), 
(2) y<yo; w is now determined by (58), 


where now C=Yyol(y +7) (65) 
and the values of d are chosen so that the best critical region may be of size «. 
Clearly as before the boundaries of the best critical regions depend on the value 
of y, and there is no uniformly most powerful test with regard to all the alter- 
natives. Moreover, the best critical regions corresponding to any two different 
alternative hypotheses are different. 

x is however a specific sufficient statistic of y, as may be shown following the 
same reasoning as in Example ITI. 

To illustrate this problem graphically the case of n=2 has been taken, for 
which equation (63) may be written 


~ 2y)? +2 (x, —2y) wy +(1 +y*) 22%} 


1 et Ata 
P (% Xp | ieee 277 (66) 


This law may be described as a normal bivariate correlation distribution for 
‘which 
Mean 2, = 2y, Mean z,=0, 


Standard deviation of z,=V1++?, Standard deviation of x, =1, 
Coefficient of correlation = — 1/V1+ 2. 
In Fig. 4 coordinate axes Ox, and Ox, have been drawn, and the elliptic con- 
tours of equal probability 


1 
pies — Qy)? + 2 (x, — 2y) e+ (1+ y?) vQ?} = x7 =5-991 (67) 


are shown for three hypotheses, namely y=0-5, y=1-0 and y=2-0. If a sample 
point E, (x,, #2), is subject to the law (66), then the probability of # falling within 
the ellipse (67) is 0-95.* Thus the three curves suggest pictorially the manner in 
which the normal correlation distribution changes with y. The straight line 
DOD’ represents the limiting form of (67) as y> 0. 


* Tf w is the region within (67), then it is known that 


P {Hew | ha fp (ares y) dx, dx,=1—e—** =0-95. 


‘Otek SHAID °H ‘} danoly 


Sa \ 


(0-4) *}{ OL Guvoau HLIM ‘wa G 
(02-4) "HE O“L auvoau HLIM ‘wood & 


\ 


J. NEYMAN AND E. S. Pearson 135 


It will be seen that all ellipses satisfying (67) must pass through the two 
points x)= —a,= +Vx?2-4= +141, 

Suppose that the hypothesis tested is that y= 1-0; then the best critical regions 
with regard to any alternative will be bounded by some line or lines, Z = constant, 
which are perpendiculars to AOB, ie. to x,=2x,. Such best critical regions have 
been indicated by shading: 

(1) For the alternative y= 2-0> +9. Here the inequality (57) must be used, and 
the boundaries will be two straight lines such as EE’ and GG’ equidistant from 
the point C, on AOB. Since from (65) the constant c, = 2, in the diagram 


O00, =V 2c, =3 V2. 
The distance between the boundaries or 22d will depend on the value of « 
chosen, i.e. upon the risk accepted of rejecting H, when it is true. 


(2) For the alternative y= 0-5 < y). Here we must use (58) and the best critical 
region will lie between two straight lines such as JJ’ and LL’, equidistant from 
the point C, on AOB. From (65), c,=} so that OC,=V2c,=1V2, while again 
the distance between the boundaries will depend upon «. 

Since the length V2c=V2y/(1+-) determining the position of the points 
C,, Cg, ... etc. is different for every alternative, the best critical regions will be 
different in every case, although the boundaries of these regions all belong to the 
family of straight lines, = constant. The region w appropriate for an alternative 
y=y, is picked out so that P {Hew | (y=y,)} is a maximum subject to 

P {Eew | (y=o)} =% 

We may also interpret the sufficient statistic in this simple case. If we take new 
rectangular axes OB, OD with coordinates u = (2, + 2%»)/V2, v= (x_—2,)/V 2, then 
the probability law for wu and v becomes 

1 (u— V 2y)? 1 a | Be 
p(w|y=—ae xe F (68) 


Thus whatever be y, for a fixed wu =V 2%, v is distributed normally with a 
standard deviation of V2 about a mean of — wu. When u (or Z) is known, it follows 
that a knowledge of v provides no additional information whatsoever concerning 
y. In this sense Z is a specific “‘sufficient” statistic of y.* 


Example V. On p. 120 of the present paper we gave a definition of a sufficient 
set of statistics with regard to parameters 6,, ... 0,. It is not proposed to discuss 
here the properties of such a set, but we shall indicate briefly with an example 
that such sets of statistics do exist. If each of n independent random variables x 


follows the normal probability law ; 


1 ~ = (a, —y) 
x .) =: e 207 (69) 
P (@% oV 20 ( 
* The sufficient estimate of y obtained from the maximum likelihood solution is found to be 
y= 2% (V2—1) if Z>0, and y=2z (1—V 2) if <0. 


136 Sufficient Statistics and Uniformly Most Powerful Tests 
then the joint probability law of x,, ... 7, may be written 


Lr 
p(y, os i pe saa 2 eee (70) 


o” (V 27)” 
where nE= & (x;) and ns*= 2X (x,;—Z)?. (71) 


Then it follows that 

(1) Zis a specific sufficient statistic of y. 

(2) There is no specific sufficient statistic of o nor a shared sufficient statistic 
of y and o. 

(3) Zand s form a set of sufficient statistics with regard to y and o. 

(4) If y is known, then y?=  (x;—)? is a specific sufficient statistic of o. 

Other examples of sufficient systems of statistics are mentioned on p. 131. 

ga 

8. Conclusions. 

The theory and examples discussed have shown that 


(a) When a system of uniformly most powerful tests exists and some other 
additional conditions are satisfied, then sufficient statistics either specific or 
shared must also exist. 


(6) When a system of uniformly most powerful tests exists there may be no 
unique sufficient statistic at all. 


(c) When a sufficient statistic exists, a system of uniformly most powerful tests 
may exist or not. 


We cannot therefore agree with the opinion expressed by R. A. Fisher* that 
the problem of uniformly most powerful tests is, as it were, covered by that of 
sufficient statistics. Neither do we think that in cases where no sufficient statistics 
exist his own method of approach, by what he has termed the theory of estima- 
tion, is the most convenient or simple one to follow. Certainly there are links 
between these two kinds of problem, which are very interesting, but no more. 

In our opinion the problems of testing statistical hypotheses should be treated 
by starting directly from some comprehensible principle expressed, if possible, in 
terms of existing concepts, such as that of probability. The concept which seems 
reasonable to us is this: arrange your test so as to minimize the probability of 
errors. Since the errors involved in testing hypotheses are of two kinds, the 
problem requires further specification which may have different forms. One of 
these has led to the theory of uniformly most powerful tests. Since it has been 
found that in many cases a solution along these lines is impossible, it follows that 
in such cases the problem of minimizing the probability of errors must be specified 


* Proc. Roy. Soc. A, vol. cxitv (1934), p. 296. 


J. NEYMAN AND E. S. PEARSON 137 


in a different form. Our further attempt in this direction has led us to a con- 
ception of unbiassed critical regions, the theory of which is being published partly 
in another paper by the authors in the present issue and partly by Neyman, 
elsewhere.* 

Before concluding we may remark that the practical statistician may perhaps 
complain that certain of our examples, e.g. III and IV, are somewhat artificial. 
With this we agree. We think, however, that they serve their purpose, which is to 
show that the premise: “‘a sufficient statistic exists”, does not necessarily involve 
the conclusion: ‘‘a uniformly most powerful test exists”. There is however 
nothing artificial or exceptional in other examples, I, Ia and II. 


* Bull. Soc. Math. de France, t. 63 (1935), p. 246-266. 


TESTS OF STATISTICAL HYPOTHESES IN THE CASE 
WHEN THE SET OF ALTERNATIVES IS DISCONTINUOUS, 
ILLUSTRATED ON SOME GENETICAL PROBLEMS 


By ROBERT W. B. JACKSON 


CONTENTS 

PAGE 
1. Introductory . : ; : . : : ; : F : : . 138 
2. Statement of the genetical problem considered : : : F : . 139 
3. Application of the likelihood principle. : 14] 

4. Criticisms of the test derived; the ieee of sringeney ond of ihe most 
stringent test ‘ : 146 

5. Comparison of the most Fe test with the Y test oa with the t test aes 
from the principle of likelihood . : “ ; : : é 150 

6. Some data facilitating the use of the most nent test in the or 
problems considered. é : : ‘ : : : é : . 154 
7. Summary of results 2 . : : : : : : : : LGd 
8. References : : 3 : : : : - : : ; ee LOL 


1. Introductory. 

The theory of testing statistical hypotheses as derived hitherto applies mainly, 
if not entirely, to the case where the set of admissible hypotheses which are alter- 
native to that tested is continuous. In order to explain the meaning of this state- 
ment I shall need the following conceptions. Denote by H, a simple hypothesis 
to be tested, by Q the set of simple hypotheses which are considered as admissible, 
by £ the so-called sample point the coordinates of which, x,, x2, ... x, may be 
given by observation, by W the sample space or the space the points of which are 
the possible positions of the sample point, and finally by w any region in W, which 
may be chosen as a critical region for testing the hypothesis H,. Further, P {A | H} 
will denote the probability of any event A as determined by some admissible 
hypothesis, H, and in particular P {Hew|H} the probability determined by H 
that the sample point will fall within the region w. This is the usual notation and 
terminology, the definitions of which may be found in publications on the theory 
of testing statistical hypotheses (1, 2,3).* 

Imay now explain what I mean by saying that the set of admissible hypotheses, 
Q. is continuous. The set Q is called continuous if, whatever the hypothesis H, 
belonging to Q, whatever the region w in the sample space and whatever «> 0, it 
is possible to find within Q another hypothesis H, different from Hy such that the 


difference | P {Hew | H,}— P{Eew | Hy}| <a. (1) 


* Figures in small type throughout the present paper refer to the publications, the list of which 
will be found at the end of the present paper. 


Rospert W. B. Jackson 139 


As already stated the theory of testing hypotheses has been built up primarily 
for cases where the set Q is continuous and the cases of discontinuity have been 
mentioned only once (3). On the other hand, it would be easy to quote examples of 
practical problems in which the set of alternatives Q is not continuous. Such are 
all cases in which the elementary probability law of the variables 7,, %, ..- Xp; 
the values of which could be given by observation, is a function of some para- 
meters which vary discontinuously, for example, are able to take only integer 
values. 

An obvious example of this kind is provided by the problem concerning the 
number, n, of pairs of genes affecting some detectable and perhaps measurable 
character of organisms. This problem will be treated below in some detail. It 
will be seen that certain conclusions and conceptions derived in this connection 
have a wider range of application. 

The problem of the number ‘of pairs of genes as treated below may seem to 
many geneticists to be very much simplified. This certainly is so. In practical 
problems in genetics, as well as in other branches of empirical science, we can but 
rarely draw a strict limit between what is exactly known and what is not. The 
geneticist often starts with the assumption that the genes he is dealing with are 
not linked and proceeds to draw some conclusions from this assumption and to 
compare them with his observations. But if he is asked whether he is certain that 
any hypothesis assuming linkage is in his case inadmissible, he will frequently 
hasten to state the contrary, adding that his calculations were only tentative, 
and that the assumption of the independence of the genes is no more than a 
“working hypothesis”’, which may be dismissed at once. 

Now in what follows, as is usual in mathematics, a clear distinction is made 
between what is known and what is not. In particular, it is assumed as known 
that there is no linkage between the genes considered and that the only circum- 
stance concerning them which is doubtful is the number of their pairs. Just this 
point may seem to be a too far-reaching simplification of the situation. However, 
it will be noticed that these assumptions are no more and no less than the “work- 
ing hypothesis” mentioned above and that the theory which follows may be 
useful to the geneticist in those stages, however short they may be, when he is 
dealing with his simplest working hypothesis described. 

From the point of view of mathematical statistics the following pages may be 
interesting as they show certain, perhaps unexpected, advantages and possi- 
bilities in the situation when the set of alternative hypotheses is not continuous. 


2. Statement of the genetical problem considered. 

We shall consider the case where a certain somatic (phenotypic) character, C, 
of an organism depends upon an unknown number of pairs of genes, n. The genes 
in question will be denoted by 


A,@,, Azz, --. A,M; (2) 


140 Testing Statistical Hypotheses 


where the capital A’s denote the dominant and small a’s the recessive genes. It 
will be assumed as known that all the genes are inherited independently (i.e. that 
there is no linkage), and that the only genetical composition which it is possible 
to identify is that of the pure recessives, i.e. the individuals with the composition 


{0,45 Gag, 7-0 AG: (3) 


For example, we may assume that the character considered, C, consists in a 
special colouring of the organism (or of some part of the same), the individuals 
of type (3) being white, say, while those of any other type, e.g. 


{A G1, Ugg, .-. B,4,,} (4) 
{A,@,, Ade, .-- &,4,,}, (5) 
{A A,, Aode, 7-4, A,}, (6) 


are coloured in some way, but that it is very difficult, if not impossible, to dis- 
tinguish the particular types represented by the colouring. 

Assume further that there exists a hypothesis, say H,, ascribing to the un- 
known number, n, of pairs of genes influencing the character, C, a certain value, 
say =n, and that it is desired to test this hypothesis. Suitable experimental 
data may be obtained in either of two ways. First we have to obtain the hybrids 
with regard to all n pairs of genes. For this purpose we may cross a pure dominant 
and a pure recessive, i.e. 


{4,A,, A,Ao,... A, Ay} x {441, Aghe; ... 0,4,}={A,a,, ... A,, @,}. (7) 


This step is common to the two different methods of obtaining experimental 
data for testing H,. They differ in the next step, which consists either (a) in cros- 
sing the hybrids among themselves (brother-sister matings) and in obtaining the 
F, generation which may contain all types of individuals from pure dominants 
to pure recessives, including all intermediates, or (b) in making a back-cross 


{A,@,, AG, ... A, @,} X {4,4,, Agas, -.. a,4,} (8) 


and obtaining a generation, say F,’, which may include individuals varying from 
hybrids with regard to all » pairs of genes to pure recessives. According to the 
assumption made previously, all the individuals of either F, or F,' may be 
classified only into two categories: pure recessives like (3) and the remainder. 
Denote by M the number of individuals forming either of the generations F, or 
F,' and by m the number of pure recessives among them. ; 

Statistically or mathematically the problem of testing the hypothesis H,, 
using the generation F, or F,’, is exactly the same. However, the numerical 
results must be different. For this reason the theoretical discussion of the problem 
will be given only once and will apply to the case where the experimental data 
concerns the generation F,. On the other hand, certain numerical results will be 
given separately for the two cases. 


Rosert W. B. Jackson 141 
3. Application of the likelihood principle. 

In some early publications 1) Neyman and Pearson suggested a principle from 
which tests of statistical hypotheses should be derived. This principle will be 
called the likelihood principle. An analysis of various tests in common use showed 
that generally we are inclined (on intuitive grounds) to reject a hypothesis tested, 
say H,, when among the admissible alternative hypotheses there are some which 
ascribe to the observed facts a probability much larger than that ascribed by the 
hypothesis tested, H,. Therefore, if the hypothesis tested, H), is a simple hypo- 
thesis, i.e. completely specifies the probability, P {A | H,}, of the facts observed, 
A, and if Pax {A} denotes the upper bound of the probabilities of the same facts 
corresponding to all the admissible alternative hypotheses, then the ratio 


as | Ho} 
Prax {A} (9) 


is a suitable criterion to use for testing the hypothesis H,. The ratio A is called the 
likelihood ratio. If, in any given case, it is small, then we are intuitively inclined 
to reject the hypothesis tested. If A is not very small, then there seems to be no 
special reason for the rejection of Hy. 

In order to distinguish between the values of A which are “‘small”’ and those 
which are not we must take into consideration the risk of an erroneous rejection 
of the hypothesis tested. Such an error is called an-error of the “first kind”. 

Assume that we have decided to reject Hy whenever we have A<)j, Ay being 
some fixed number 0<A,<1. If the hypothesis tested, Hy, happens to be true, 
then the probability of an erroneous rejection of H, will be identical with that of 
getting A<A), or P{A <A, | Ho}. If the probability law of A is known, then we may 
fix in advance any number «> 0, and then find a value A, (as large as possible) 


such that PiracdNipee. } (10) 


A= 


If we make a rule of rejecting H, when, and only when, A<Ag , we may be certain 
that we shall be wrong in doing so only in a proportion of cases not exceeding «, 
where the value of e can be chosen according to the risk of error we are prepared 
to accept. Besides, in using the rule based on A, we shall reject Hy only when 
among the admissible alternative hypotheses there exist such which make the 
observed event much more probable than does the hypothesis tested. 

At a later stage 2,3) Neyman and Pearson suggested a more rational approach 
to the problem of testing hypotheses, based entirely on the probabilities of errors 
occurring in the rejection and non-rejection of a hypothesis. However, this method 
of approach is somewhat more elaborate than the application of the principle of 
likelihood and, until the present time, all tests derived from the principle of 
likelihood proved satisfactory when analysed from the point of view of the 
probabilities of errors. Therefore it seemed natural to start by deriving a test of 
the particular genetical hypothesis considered above by applying the likelihood 


142 Testing Statistical Hypotheses 


principle. It will be seen, however, that in this particular case the test derived is 
not satisfactory. 

To apply the likelihood principle we must specify the hypotheses which we 
shall consider as admissible. I am not aware of any method of determining the 
maximum value of the number of pairs of genes, , influencing any character, C. 
Therefore I shall consider as admissible any hypotheses ascribing to n any positive 
integer value: n=1, 2, 3, .... The hypothesis tested, H,, specifies one of these 
values, say N=Np. 

If we obtained an F’, generation of M individuals, of which m were pure reces- 
sives, then the probability of this event, say A,,, corresponding to any fixed 
value of n, is easily found to be 


Pimint= rama} toa t (11) 


Substituting n=, in (11) we shall obtain the probability determined by the 
hypothesis tested (i.e. the numerator of (9)). In ordef to calculate the denominator 
of (9) we shall have to calculate the maximum value of the probability (11) 
considered as a function of n, where m and MM will be considered as constant. 

Asn can take only integer values, the usual method of differentiation cannot be 
applied. We shall use the following method: 


Define 
_ Phm|n} _f1y*/,, 8 
OS ee ey, (4+ 5554] for n= 2, 3,.... (12) 


It will be seen that @, is a decreasing function of n, tending to 4-™< 1 when n 
is increased without limit. 
For m> 0, the two following cases may arise: 


(a) @.<1. Then all the other Q’s are less than unity, so that we shall have 
P{m|1}>P{m|2} and P{m|n-l}>P{m|n} forn>2. (18) 
Thus the value of n corresponding to the largest probability of A,, is always unity. 
(6) Q,>1. The series of numbers Q will have the property that 


Ver ce! Wy > uy FW eee (14) 
and it follows that 
Pim|< Pi{m|2}<...<P{m|n'-l}<P{m|n'}>P{m|n'+1}>.... (15) 
In other words, in this case the value of n corresponding to the largest pro- 
bability of A,, will be n’, for which 
Qn’ 21> Qn’41- (16) 


If Q,,=1, then there will be another value of n having the same property, 
namely n=n'—1. 


Rosert W. B. Jackson 143 


It is convenient to solve the inequalities (16) with regard to m, assuming that 
n' has a fixed value. We obtain the inequalities 


log 4 m log 4 


ae age] og ( acy 
determining the range of m/M within which the value of n corresponding to the 
largest probability of A,, is the same, namely n=n’. 


If n’ = 1, only the left-hand side of (17) has any meaning, and we see that n’ = 1 
maximizes P {m | n} whenever, say, 


} = 


log4 m 
eee a (18) 
Further, n’ = 2 is the value of n maximizing P {m | n} whenever, say, 
ts log mm (19) 


= eras. 
Gt log 4.2m ~% 


In general, for x’ > 1, n’ will be the value of n maximizing P {m | n} whenever, say, 


eee eee (20) 
: 3 M 
06 (4+ 72) 


my 
A direct calculation gives the following values: 
* a, =+138647, 
dy = 033998, 
a, =-008465, 
a, = "002114, 
a = 000528, 
a, = 000132. (21) 


Turning to formula (9) for A we see that it could be written in the form 


_ P{m| nr} _ 1 \(%o-”) M (4% — 1) M—m 
= Bim) (i Ea eu 


m 
for On < FF fe Fo a (23) 


where A means the likelihood ratio for the hypothesis that n=). 

For all m satisfying the inequalities (23) the probability of A,, in the denomi- 
nator of (22) will be determined by the same value of n, namely n=n'. It follows 
that if m falls within the limits 


m 
On < U KO 15 (24) 
then the denominator of (22) will be exactly equal to the numerator, and thus 


\=1 for all m’s satisfying (24). Of course outside the range determined by (24) 
the value of A will decrease. 


144 Testing Statistical Hypotheses 


Fig. 1 gives the graph of A considered as a function of m, and the probability 
law of m, for n»=3 and M = 250. 

In applying the likelihood principle to derive the test of the hypothesis speci- 
fying n =), we have to find a value, say Ay, such that the probability, determined 
by the hypothesis tested, of having as a result of an experiment the value A<A, 


GRAPH OF A_ AND PROBABILITY LAW OF_m. 


FOR N.=3, M= 250. 
PROBABILITY LAW p(minz3) THUS e«—e—e 


GRAPH OF THUS ~~ o---e---0---0 


) 
wi 


fP---0---0---0 
¢ 


/ 


20 


03) 


p(min 
Th 


! 
| 
! 
1 
' 
| 
' 
! 
| 
! 
| 
D 


O 


SCALE OF 
5 : 
Wn 


4 6 
SCALE “OF mM. 


Fig. 1 


should not exceed and be as close as possible to « > 0, which is chosen in advance. 
Then we should make a-rule of rejecting the hypothesis tested whenever A < Ap. 
It is seen that this reduces to the following steps: 

(a) Find two integers 1<r such that for any m<J/ or any m>r we shall have 
A<d,, and such that P{l<m<r|n}>1-—e. 

(b) Make a rule of rejecting the hypothesis tested whenever m <I or m>r. 


The set of values of m satisfying 
l<me<r (25) 


Ropert W. B. Jacxson 145 


will be called the region of acceptance of the hypothesis tested, and the set of 
remaining values of m, i.e. m <1 or m>r, will be called the critical region and will 
be denoted by w. The hypothesis tested will be rejected whenever m falls within 
this critical region w. As the critical region is divided into two parts, one in- 
cluding the values of m <1, and the other the values of m >, it will be convenient 
to call these two parts the left and the right critical regions and denote them by 
w, and w,, respectively. We shall denote by «, and «, the values of the pro- 
bability, determined by the hypothesis tested, that m will fall within w, and w, 
respectively. The sum, ¢,+¢,=e, will be equal to the probability determined by 
the hypothesis tested of its rejection when true. Table I below gives the values 
of l and r, and «,, e,, « for several values of Ap starting with the largest equal to 
unity. The values of A have been calculated directly from formula (22), and are 
given in Table II. The «’s were estimated approximately by using a Poisson 
formula, i.e. assuming that approximately 


aon tela (26) 


where X = M4-", 


TABLE I. Probabilities of error corresponding to the test based on the principle 
of likelihood. Hypothesis tested: ny=3. Size of sample: M = 250. 


TABLE II. Values of A as a function of m. Hypothesis 
tested: Ng = 3. Size of sample: M = 250. 


*0195 | -210 


From Tables I and II we see that if we fix Ay = 1, and thus make a rule of rejecting 
the hypothesis tested whenever \< 1, we-shall have to reject whenever m<1=3 
and m>r=8. If the hypothesis tested, that »=”)=3, be true, then the pro- 
bability of its being rejected unjustly will be identical with that of obtaining either 
m<3or m>8, ie. €=:2709. This is a rather large risk, and probably we should 
like to have a test involving a smaller one. Looking through the column of values 
of e=e,+«, we see that a reasonably small probability of error of rejection is 
reached when we fix A,=-210. The corresponding critical region will be deter- 
mined by the inequalities m < 1=1,m>r=9, and the probability of the first-kind 
error determined by the hypothesis tested will be «= -0271. 


PSM 


10 


146 Testing Statistical Hypotheses 


4. Criticisms of the test derived; the conceptions of stringency and of the mosi 
stringent test. 

Let us now consider the consequences of frequent applications of the test 
derived from the likelihood principle. We must remember that the errors in 
testing hypotheses consist not only in the wrong rejection of the hypothesis 
tested (first-kind error) but also in a failure to reject the hypothesis tested when 
it is in fact false. The latter has been termed an error of the second kind, and this 
is committed when not Hy, but some alternative, say H, is true and the observed 

‘80 


He ae 
THE PROBABILITY LAW p(min) = meat () \1-(3)| ,THUS — 
GRAPH OF A, THUS 


p(min) | 


SCALE OF 
n 
So 


SCALE OF A 
> 


30 40 
SCALE OF mM. 


Fig. 2 


value m falls within the region of acceptance of Hy, i.e. 1<m<r. The position is 
illustrated on Fig. 2, which gives the graph of A for n»=3 and the probability 
laws of m (for M = 250) determined by the hypotheses: 

H, assuming that n=n,=1, 


H, + N= Ny = 2, 
Hy, es N=Ny = 3, 
H, 5 N= Ny = 4, 
H, N=) =5. 


It can be seen that if H, is true then the value of m is almost certain to fall within 


the “right” critical region and therefore we may be practically certain that no 
second-kind errors will be committed. 


RoBertT W. B. JacKxson 147 


This will not be the case when the true hypothesis is H 2- In fact, if we choose 
to apply the critical region determined by \y=-210 and thus /=1 and r=9, the 
frequency of m falling within the region of acceptance I <m<9, when H 2 is true, 
will not be negligible. This probability has been calculated and its value is given 
in Table I in the column marked n,. Thus 7, is in each case equal to the probability 
of getting the observed value of m within the region of acceptance calculated 
under the assumption that the true hypothesis is n =n,.—1, not the hypothesis 
tested n=. Since we accept the hypothesis tested, H,, whenever m falls within 
the region of acceptance, 7, is equal to the probability of second-kind errors 
calculated for the cases when the true hypothesis is H,, 1; we shall call it », 
(y-right), for short. Its value corresponding to /=1 and r=9 is -0520, which is 
not a very large number. 

Let us now consider the consequences of our applying the test with A, =-210 
when the true value of is larger than n)=3. If the true hypothesis is H,, then 
the distribution of m will correspond to the graph marked n= 4, and it is seen at 
once that a very large part of this will fall within the region of acceptance. The 
value of the probability determined by H, of m falling within the region of 
acceptance, or in other words the value of the probability of second-kind errors 
denoted by 7, (i.e. 7-left), is given in Table I and is equal to -6312. It follows that 
in cases where the true value of n is n=n)+1=4, in applying our test we shall 
fail to reject the hypothesis tested in over 63 per cent. of all cases! The pro- 
bability of second-kind errors corresponding to the cases when the true value of 
n is equal to n=n,+2=5 is smaller, but still considerable. This could hardly be 
considered as satisfactory, and it seems reasonable to inquire whether, by altering 
the region of acceptance, it would not be possible to diminish the probabilities 
of error in applying the test. 

Until the present moment we considered the probabilities of first- and second- 
kind errors as determined by the particular hypothesis under. consideration. Let 
us now consider and denote by P {Ir} the probability of an error (of any kind) in 
testing the hypothesis H, by applying the region of acceptance 1<m <r. 

In order to calculate P {Ir}. we need the a priori probability that any particular 
admissible hypothesis is true. Let ¢; denote this probability for the hypothesis 
H, (¢=1, 2, ...) and ¢, for Hy. 

The probability P {l,r} is equal to the sum of the probabilities of first- and 
second-kind errors. The probability of the first-kind errors is equal to the product 
¢o<, of the probability of the hypothesis tested being true, and of the probability e, 
determined by this hypothesis, that m falls within the critical region. The pro- 
bability of a second-kind error is equal to the sum of products of ¢; and of the 
probabilities, say 7;, determined by the corresponding alternative H, of m falling 
within the region of acceptance; this expression is equal boae es n;- Therefore 


Pil, r}=$oe+ & bin: (27) 
i#0 


10-2 


148 Testing Statistical Hypotheses 


It may be useful to explain the meaning of the probabilities denoted by ¢. 
Consider the aggregate of cases in the future in which any hypothesis similar to 
the one under consideration, H,, will be tested. ¢; is the proportion of this 
aggregate in which the true value of n is equal to n,. Obviously the values of ¢ 
are unknown, and could not be known. So is the value of P {l,r}. However, we 
may easily calculate the upper limit which P {/,r} cannot exceed. Denote by « 
the largest of the numbers e, 71, 72, -.-, then it is seen that we may replace (27) 


by the inequality P{l,r}<a[dot =u HJ =a, (28) 
i#0 


as the sum of all the ¢’s must be equal to unity, ie. U¢;=1. 


A similar argument has been used by Neyman and Pearson (3). It is interesting 
to note that the problem of testing the genetical hypothesis under consideration 
is less difficult than that considered in their paper. They considered the case 
when the set of alternative hypotheses was continuous, containing hypotheses 
which ascribed to any possible system. of variables under consideration pro- 
babilities differing from those ascribed by the hypothesis tested by less than any 
positive number chosen in advance. It was found that the upper bound of the 
probability of error in testing such hypotheses could not be less than 3. Moreover, 
if the probability of the first-kind error should not exceed « <4, then the upper 
bound of what we now denote by P {l,r} would be 1—«> 4, and usually a rather 
large number. 

Having described these facts Neyman and Pearson suggested that the tests 
to be applied should have the properties of 


(1) controlling the first-kind errors at a fixed probability level, and 


(2) of reducing the probabilities of second-kind errors determined by particular 
alternative hypotheses to as low a level as possible. 


In our case the set of alternative hypotheses is not continuous which makes it 
possible to arrange a test, say 7’,, based on a consideration of the total pro- 
bability of error, having the property that (a) the total probability of error could 
not exceed a fixed number «, while (6) for any other test, 7’, the total probability 
of error could exceed « if the a priori probabilities proved unfavourable. The test 
having this property, and having the broadest possible region of acceptance, will 
be termed the “most stringent” test. The lower bound of the total probability 
of avoiding an error in applying any test will be called its “stringency” and 
denoted by B=1—a. This idea of the most stringent test and of the measure of 
stringency was suggested by Dr Neyman. 

The problem of finding the most stringent test of the hypothesis, Hy, consists 
in finding two numbers, / and r, such that 


(1) the upper bound of the probabilities 
e=P{m<l|H,}+P{m>r|H,}, and ,=P{l<n<r|H,} (for any 140) 


has the smallest possible value, «, 


RoBert W. B. Jackson 149 


(2) the difference r—/ is a maximum.* 

It is obvious that the problem of the most stringent test may be similarly 
stated in any other case where the set of admissible hypotheses is not continuous. 

The problem is greatly simplified by noting that, for any particular region of 
acceptance /<m<r corresponding to the hypothesis to be tested, n=n,, the 
greatest possible value of 7; will be either 7, = Nn,-1 OF M1 =n, +1- Lhus to determine 
« belonging to any particular region of acceptance we need only consider the 
values of (1) 


€;+€,=€ 
(2) m= 415+ (29) 
(3) r= Nn —1 


All possible regions of acceptance, then, may be divided into the three systems: 
(1) Sp: defined by «> 


ease 
€2, 
(2) 8;: defined by 7,2€¢ 
L=f- 30 
m=, UT (30) 
(3) S,: defined by 7, 2€ sh 
121 bi 


From these systems we choose the region of acceptance which satisfies the require- 
ments of the most stringent test. 

Because of the discontinuous distribution of m, the regions of acceptance 
belonging to the systems (30) have only been obtained by a trial and error method 
using particular hypotheses and certain values of M. It has been found that for 
large values of B we may ignore the regions determined by S, because the values 
of e,and 7, are more important in determining the value of « than those of e, and 7,. 
In fact, we may generally choose «, close to zero, or at least small in comparison 
with e,. Thus we choose our value of « from e; or 7, and choose 7 so as to make r —] 
as great as possible, subject to the condition that , < «in each case. The method is 
illustrated in Tables III and IV, for n»=3, M = 250, and n)= 2, M = 250, respec- 
tively. 

TABLE III. Illustrating the process of determining the most stringent test. 
Hypothesis tested: ny»=3. Size of sample: M = 250. 


The most stringent test corresponds to the region of acceptance 
3<m<12, «=-2524. 


* This condition is introduced for the sake of uniqueness of the solution, as there may be several 
regions of acceptance satisfying the condition (1). 


150 Testing Statistical Hypotheses 


TABLE IV. Illustrating the process of determining the most stringent test. 
Hypothesis tested: ny = 2. Size of sample: M = 250. 


The most stringent test corresponds to the region of acceptance 
9<EME< 48, a=-0269. 


5. Comparison of the most stringent test with the x* test and with the test derived from 
the principle of likelihood. 


When comparing the most stringent test with that derived from the principle 
of likelihood it must first be pointed out that the purposes of these tests are 
different. The purpose of the A-test is (1) to make sure that when among the alter- 
native admissible hypotheses there are some ascribing to the facts observed a 
probability much greater than that ascribed by the hypothesis tested, then the 
latter will be rejected, and inversely, (2) to reduce the probability of an error of 
the first kind to a level, «, chosen in advance. 

It will be noticed that these two conditions determine in each case the pro- 
bability of second-kind errors, but that the value of this probability is not 
explicitly taken into account when the A-test and the level « are actually fixed. 

On the other hand, the purpose of the most stringent test is to control both 
kinds of errors equally, so as to reduce the largest of their probabilities to as low 
a level as possible. 

Consequently we may use the A test with varying levels of significance, i.e. 
choosing different values of «. In each case, 7, or the greatest of the probabilities 
of second-kind errors will have a different value, which may be smaller than, 
equal to, or greater than the value of « chosen. On the other hand, by the most 
stringent test the values of ¢ and 7 are fixed and it is impossible to alter either of 
these without depriving the test of the property of being the most stringent one. 
The greater of the numbers « and 7 is denoted by «, representing the upper bound 
of the probability of an error (of any kind) when applying the most stringent test. 
The characteristic property of this test is that the number « thus defined calculated 
for any other test is larger than for the most stringent test. 

Having this in view, it is easily seen that a further comparison between the 
two tests described reduces to the question of whether, by a suitable choice of e, 
the A-test may be made identical with the most stringent test. This question is 
answered in the negative, and the answer is obtained by consulting Table I 


1 2 


RosBert W. B. Jackson 151 


giving all possible values of ¢, the n’s and « corresponding to the A-test, as applied 
to the hypothesis n) = 3 with M = 250. It is seen that the smallest value of « = -2634 
corresponds to both Aj=-850 and A)=-488, the region of acceptance being 
2<m<8 and 2<m<Q, respectively. On the other hand, as it was found in 
Table III, the region of acceptance corresponding to the most stringent test is 
3<m < 12 with «=-2524. The difference in stringency is not large, but the fact 
that it is not zero shows that in general the most stringent test could not be 
obtained from the A-test merely by choosing properly A, or e. 

The x? test has been used (4) to test hypotheses of the kind considered in the 
present paper, and it seems useful to consider the consequences of its application 
and to compare them with those corresponding to the most stringent test. 

As is well known, in the present case 


x= (m—Mp,)* 
Mpy(1—po)’ 
where p,=4-".* The test consists in rejecting the hypothesis tested when the 


value of x* is “‘too large’’, exceeding certain limits, such as y%9,, x29, etc., 
where generally x,” means a number such that 


A Bad 
a 2 dy = 
2p e- 2X’ dy =e. (32) 


The « in this equation is not exactly equal to the probability of a wrong rejection 
of the hypothesis tested when it is true, but the inaccuracy is not very important. 
Therefore, choosing «= -02 or e=-1 we may be certain that we shall not commit 
errors of the first kind too frequently. But now let us consider the probabilities 
of accepting the hypothesis tested when some alternative hypothesis is true. 
This is most conveniently done by considering Fig. 3. 

It gives, as open circles joined by dotted lines, the graph of x? considered as a 
function of m, for n) = 3, M = 250. The horizontal lines correspond to four different 
levels of significance « =-10, -05, -02, and -01, and the points of their intersection 
with the parabola representing the graph of y? mark the regions of acceptance. 
Thus, for instance, if we choose to reject the hypothesis tested (that n =7, = 3) 
whenever x?> 2.9 this will be equivalent to rejecting the hypothesis when 
either m < 1 or m> 7, so that the region of acceptance is 1 <m <7. This is found by 
noticing that the abscissae of points of intersection of the horizontal straight line 
marked x2,, with the graph of x? are between 0 and 1 and between 7 and 8, 
respectively. 

It is seen at once that this region of acceptance, 1 <m<7, will not be very 
satisfactory from the point of view of controlling errors of the second kind. In 
fact the graph of p(m|n=4), representing the probability law of m when n= 4, 
shows that in this case m will fall within the region of acceptance more frequently 


(31) 


* If we were dealing with the back-cross generation, then it would be py»=2~e. 


152 Testing Statistical Hypotheses 


than not. Butit isnot customary to apply the x? test at a level of e=-10. Of course 
there is no definite rule concerning this point, but I think it will be true to say 


GRAPH OF X2AS A FUNCTION OF m 
FOR N=3, M=250. 


SCALE OF mM. 


Fig. 3 


that in the great majority of cases when the y? test is applied in practice, the 
hypothesis tested is considered only as doubtful when x? > x, and it is definitely 
rejected only if y?> x29., or perhaps when y?> yo. 

Examining Fig. 3 it is seen that the region of acceptance corresponding to the 
level of significance ¢ = -05 is identical with that corresponding to «=-10. This is 


Ropert W. B. Jackson 153 


of course due to the fact that the probability law of x only approximately repre- 
sents that of m. As to the regions of acceptance corresponding to «= -02ande«=-01, 
they are broader than that corresponding to «=-05 and, since m cannot assume 
negative values, are again both identical, determined by the inequalities0 <m< 8. 

“It is important to understand clearly the significance of these inequalities. If 
we decide to reject the hypothesis tested that n=n,=3 only in cases when 
xX? > x20 or x? > x29,, then as a result we shall never reject it in cases where the 
number of pure recessives in the F, generation has a value between 0 and 8 in- 
clusive. Looking again at Fig. 3 we see that if the true value of n happens to be 4 
(or more) then, for M=250, the number of recessives in the F, generation is 
practically certain to be less than 8. It follows that whenever the true value of 
n> 4 and we test the hypothesis that n =n) = 3, applying the ,? test at the level 
of -02 or -01, then the hypothesis tested will be never rejected in spite of its being 
false! This is, of course, most unsatisfactory, remembering that when the most 
stringent test is used errors of either kind in testing the hypothesis considered 
would be avoided on the average in more than 74 per cent. of cases. 

It is easy to see what are the circumstances in which the ? test fails to control 
the errors of the second kind as described above. Solving equation (31) with 
regard to m, we obtain 

m= Mpy+xV Mp9 (1— Po); (33) 


from which it follows that the left limit of the region of acceptance, say /,, 

6 . . . : 
corresponding to any level of significance, e, is the smallest non-negative integer 
such that 


1, > Mpo—xeV MPo(1—Po)- (34) 
Consequently J, =0 whenever 
Mpo—x-V M po (1—Po) <9, (35) 
or, solving the inequality with regard to M, 
M<y2+—. (36) 
Po 


It follows that the insensitiveness described of the x? test occurs when JM is not 
large enough. It is seen also that the efficacy of the x? test is improved when we 
use the back-cross generation instead of F,. 

It will now be interesting to see what may happen in cases when J is larger 
than the right-hand side of the inequality (36), so that x? does give some control 
of the errors of the second kind. Table V gives the regions of acceptance and the 
probabilities of errors corresponding to the two tests considered when applied 
to the hypothesis that n =n) =2 (when MM = 250). 

Examining this table we see that while the x? test does control both kinds of 
errors it is still unsatisfactory. If it is applied at the level of significance -05, the 
actual value of « is -0561 and, by applying the most stringent test, it could be 


154 Testing Statistical Hypotheses 


TABLE V. Regions of acceptance and probabilities of errors corresponding to the x? 
and the most stringent test when applied to the hypothesis that n =) = 2 (M = 250). 


Region of 


acceptance Probability of errors 


Test 


, Most stringent 


reduced by more than half without undue increase in the probability of an error 
when the true value of n happens to be either 1 or 3. On the other hand, if the x? 
test is applied at the level -01, the probability of a wrong rejection of the hypo- 
thesis tested is reduced to the level -0118, but this is attained at a heavy price. 
For there is now a danger of accepting the hypothesis tested (that n) = 2) in over 
20 per cent. of cases when the true n= 3. 

Considering that the application of the most stringent test guarantees that the 
probability of any kind of errors could not exceed the limit -0269, we must 
conclude that the x? test can hardly be considered as satisfactory. 


6. Some data facilitating the use of the most stringent test in the genetical problems 
considered. 

To use the most stringent test in practice we require 

(1) the limits, J and 7, of the region of acceptance, and 

(2) the value of the stringency, §, corresponding to the particular hypothesis 

tested, to the size of the sample, M, and to the kind of experimental data 
available, which may be either /’, or F,' (back-cross) generations. 

Owing to the particular character of the probability law p(m|n) it was im- 
possible to give a solution of the above problems in simple theoretical form, and 
it is here presented in the form of tables and charts from which the necessary data 
can be readily obtained with sufficient accuracy. 

If the size of the sample, M, is comparatively small, the limits of the region of 
acceptance, / and 7, are changing irregularly. Since also an error in these limits, 
even if only equal to a unit has a considerable effect on the stringency of the test, 
it seemed useful to give exact values of / and r for a number of small values of M. 

Table VI gives such values for 10 < M < 50, for the two kinds of experimental 
data and for the most important hypotheses likely to be tested, namely n)=1, 
Ny = 2. N=3. This table was calculated using the Incomplete Beta Function 
Tables (5). 

The columns for F’, generation and for n)= 38 show a remarkable anomaly. This 
is a result of the obvious fact that in the circumstances very small samples 
could not discriminate between the values of » with any sort of accuracy. 


Rospert W. B. Jackson 155 


TABLE VI. Regions of acceptance corresponding to the most stringent test for 
different hypotheses and different values of M. Hypotheses: ny=1, 2, 3; Fy 
and F,’ generation. 


F, generation Back-cross generation 


N= 1* Ny=2 Ny=3 


a a | a ee | ene ene 


~ 
~ 


~ 
~ 
=e 


= 
~~; 


OOO WOMMWMAUNBAAAaANAHL 


OO GO OO BO 00 00 DH WH DW WIMDMARBAARAAIAAaAMATP PP PD PB OO 09 09 GD DODD DODD 


S© ] 00 00 00 G0 G0 GC COMTI TTTITIABMWMBNAITHE APPR wWwWWWWWwWhd do bbb 


> 
Fett prt rt tet fret freed mat fr fet ret pt fd fd food fod fd fd fo fot fd) foe fed fd ol fod fd fed fed fod domed fod feed ed fed ed fed fod fd os 
SOD OOO 0 WO 0 WH DW IWIIIIIBAABBWBBHBNANA TP PPP PID oo 09 09 09 09 CO DD ND 
Anrh PPL P PP PR OO CO WW WWWWWWWWHDDNNNNNNNNNH eee 


NNN NHNNNYNNNHNNHNHNHNNNYNHNNNNHNNYNYNNNYWw 


TTT ST ATTA D DD HD AD MH OV OV OV OV OV Or OT OU HH HR HB Go 09 09 08 09 09 09 bO bD bD bo be bo 
— 


BD BD BD DD DODD BOD DDD DDD DODD DD NN ee ee ee te eee pe et ee fe pe ee pe 


* For n.=1, r is always equal to M. 


Beyond the range of this table, i.e. for M > 50, the changes of / and rare not 80 
irregular. A series of trials showed that while / and r increase with M increasing, 
the ratios 1/M and r/M tend rapidly to certain fixed limits, say K, and K,. These 
limits are easy to calculate if we assume that for large values of M the value of 


e, is negligible and that Ae NG (37) 


156 Testing Statistical Hypotheses 


Since for large values of M, ¢,, 7, and 7, are-all represented with increasing ac- 
curacy by normal integrals, we may write the following approximate equations: 


1 eee l—-Mp_, . 
os ex" ” where eee (38) 
sa [, : | d VMp_19-1 
r) a 

€= =| e-X"!2 dy ( where o= ae : (39) 
V Qa =O M 994%! 

=— eX? d ( where == }, (40) 
d Fal ag *” ipsa: 


where, in the case when the experimental data concerns the F, generation, 
Do=1—q=4-, p_»=1-¢_,=4°-%, py=1—q,=4-™"1, and when we are 
dealing with the back-cross generation. py=1—q)=27™, p_y=1—q_4=2-1, 
pPy=1l—gqy= 2-1” 

Substituting the integrals (38), (39) and (40) in (37) we easily obtain two 
equations in / and r of the form 


M V Do dot Vp_y q_4 


l _PaV Pog + PoVPat1_ KX, (say) (41) 


aa Py ky Sua K 
= 2 — —=- == = Ky (say), (42) 
M VIN Vp_y q-1 


which give the limiting values of the ratios 1/M and r/M for M—oo. Multiplying 
K, and K, by the size of any given sample, M/. we obtain the approximate values 
of corresponding / and r. However, the approximation is found to be better if 
instead of taking = K,M and r=K,M we take for the values of / and r the 
integers nearest to the expressions 


MA=MK,+} and Mp=MK,-}. (43) 


Fig. 4 shows the graphs of the correct ratios 1/.M and 7/M and those of A and p all 
considered as functions of M. 

It is seen that for values of MM over, say, M = 40 we shall hardly ever make an 
error in calculating / and r from the formula (43). Fig. 4 applies to the case when 
N= 2 and the experimental data concern the back-cross generation, F,’. In other 
cases the position is-similar, but with a disadvantage for the F, generation. 
However, if the value of M is considerable, even if we do make an error of unity 
in the values of / and r, this will influence the value of the stringency of the test 
but very slightly. Table VII gives the values of A, and K, for n»=1, 2, 3, 4 and 
for the two kinds of experimental] data. 

Tables VI and VII give a practical solution to the problem of determining 
the regions of acceptance, corresponding to the most stringent test, in any 
particular case. The method may be summarized as follows: 


(a) If the size of the sample M < 50, use Table VI and reject the hypothesis 


Rosert W. B. Jackson 157 


tested whenever the number of recessives, m, is either less than / or greater than r 
as given in this table. 

(0) If the size of the sample M > 50, then use the values of K, and K, as given 
in Table VII and calculate MA and Mp from formula (43). Reject the hypothesis 
tested when the number of pure recessives, m, is either less than the integer 
nearest to MA or greater than the integer nearest to Mp. 


GRAPHS OF £. 3.2 &/.AS FUNCTIONS OF M. 


i 
Oo 


WwW 
2) 


N 


Q. 
od 

~ 
% 
Oo 
LJ. 
= 
< 
UO 
vo 


S) 


OM ag ee 


SIZE OF SAMPLE.(M). 


Fig. 4 


AMEE VIN advises waluos of” Ae. : a for certain: hypotheses. 


poe F, generation Back-cross (F,’) generation 
iain SS ee | 
tp Ky K, 
: 28868 1 
3 -17913 -41817 
| 3 -O8891 *20275 
| ri 04432 -10016 


Example. Assume that the hypothesis tested is m)=3, that the size of the 
sample of F, generation is M=549, and that the number of pure recessives in 
that generation is m= 2(7). 

From Table VII we find K,=-00783, K,=-04728. Substituting these values 
ngesererentaan MA=549 x -00788 +°5, 

Mp =549 x 04728 —-5. 


158 Testing Statistical Hypotheses 


The integers nearest to MA and Mp are 5 and 25 respectively, and it follows that 
the hypothesis tested, that ny = 3, should be rejected in the present case. It may 
be useful to notice that the rejection occurs in favour of the alternative hypothesis 
that n= 4. 


STRINGENCY 
rs 


BACK CROSS GENERATION 
HYPOTHESES: N.=1,2,3. 


nD 


20 30 40 
SIZE OF SAMPLE (M™). 


® 


aS 


see 
> 
(Ow 
Zz 
tj 
©) 
Zz 
a 
load 
Y) 


NY) 


F, ~GENERATION . HYPOTHESES: N;l,2,3. 


20 30 
SIZE OF SAMPLE (mM). 


Fig. 5. Graphs representing the dependence of stringency on size of sample ~ 
for different hypotheses. 


Besides the limits of the region of acceptance, the practical application of the 
most stringent test requires the knowledge of the stringency of the test corre- 
sponding to any given case. This is particularly useful and should be taken into 
account when planning an experiment which will provide the data for testing a 


RosBert W. B. Jackson 159 


particular hypothesis concerning the value of n. The questions which should be 
asked here are: 


(1) What kind of data is preferable; those concerning the F, generation or 
the back-cross? 


(2) What should be the size of the sample, M, to obtain a reasonably reliable 
verdict concerning the hypothesis tested ? 


F,- GENERATION HYPOTHESES nN; 2,3. 


32 ==. mere 
a I Ere ares 


FESR: poe an 
BS 


@ 


STRINGENCY, B 


100 150 
SIZE OF SAMPLE (M) 


Neseke 


B 


ne 
2 
WwW 
Uv. 
4 
a 
EF 
o. 


7 


STRINGENCY 
® 


4 


150 
SIZE OF SAMPLE (M) 


Fig. 6. Graphs representing the dependence of stringency on size of sample 
for different hypotheses. 


These two questions are answered by using the graphs showing the stringency 
of the most stringent test, considered as a function of M, given in Figs. 5 and 6. 
For M < 50 these values of 8 were calculated using the Incomplete Beta Function 
Tables (5), and therefore are accurate. For large values of M, which fall beyond 
the range of the Tables, the values of 8 were calculated only approximately, 


160 Testing Statistical Hzpotheses 


partly using the normal integral and partly the Incomplete Gamma Function 
Tables (6) giving the probability integral of the Poisson Series. 

It is important to understand the meaning of the information contained in 
Figs. 5 and 6. When ny, M, and the kind of the generation F, or F,’ are fixed, 
then the probability, say P, of avoiding an error in testing the hypothesis that 
= Ng is not fixed, but may vary according to the value of n which happens to be 
true and according to the particular test employed. The value of f as given in 
Figs. 5 and 6 is the lower limit of P in the case when the test employed is the most 
stringent test. Therefore if we use the most stringent test the probability of a 
correct judgment concerning the hypothesis tested is never less than f, though it 
may be equal to or even larger than 8. On the other hand, if we use any other test, 
different from the most stringent test, then the lower limit of the probability of 
avoiding an error in the verdict will be smaller than 8. In other words, f is the 
lower limit of the probability of a correct verdict concerning the hypothesis 
tested, corresponding to the test for which it has the greatest possible value. 

We may now consider Figs. 5 and 6. They present a somewhat unexpected 
feature in that the stringency, f, is not a monotonically increasing function of the 
size of the sample M. Intuitively we would expect that if M,<M,, then the 
verdict on the hypothesis tested based on a sample of M, must be always more 
reliable than that based on a somewhat smaller sample of M,. However strange 
it may seem, this is not always the case and certain fluctuations in the inverse 
direction do occur. After some consideration the cause of these fluctuations is 
easily understood to be due to the fact that m, and also the limits of the region of 
acceptance / and 7, may take only integer values. M may increase, but for some 
time / and 7 remain unchanged. During this period 8B may increase and then 
decrease until it becomes possible to improve its value by changing / or 7, or both. 

Looking at the diagrams we see that for a fixed value of M, the verdict on the 
hypothesis tested based on the back-cross generation is always more reliable 
than that based on F,. For instance, if n»=3 and M=50, the probability of a 
correct verdict based on the most stringent test applied to the back-cross genera- 
tion could not be smaller than Bf =-72. On the other hand, if we deal with the F, 
generation, whatever the test applied, the probability of a correct statement 
concerning the hypothesis tested may be as small as B=-48. 

This is the answer to question (1) of p. 159 above. The geneticists seem to be 
aware of the advantages of the back-cross generation data, but I am not clear 
what reasons have been advanced in support of this belief. : 

Considering Figs. 5 and 6 from the point of view of question (2), we see that the 
larger the hypothetical value of n =n), the larger must be the sample to give the 
same value of the stringency 8. For what value of 8 the test may be considered 
reliable is, of course, a personal question. However, it seems to be certain that 
no one would consider a test as reliable if B is less than i, In fact, a frequent 
application of the test in such a case may result, in unfavourable conditions, in 


Rosert W. B. Jackson 161 


more than 50 per cent. of erroneous conclusions! If the stringency is below 4, 
then the most stringent test should not be used, for if it is impossible to control 
both kinds of errors to a suitable extent we may at least control one of them. 

Whatever value of f is fixed as determining what may be considered as a 
reliable test, the charts give the means of fixing the size of the sample at which 
we have to aim when arranging the experiment. For example, if the hypothesis 
Ng = 2 is tested on a back-cross generation, and if it is desired to have B>-9, we 
have to aim at the size of sample M > 73. 


7. Summary of results. 


The theory of testing statistical hypotheses, when the set of admissible hypo- 
theses is discontinuous, has been discussed using a simple genetical problem as an 
illustration. The principle of likelihood (Neyman and Pearson) has been used to 
derive a test, but this test proved to be unsatisfactory. A test, which has been 
called the most stringent test, has been developed based on the consideration of 
the total probability of errors of any kind involved in the testing of statistical 
hypotheses. Numerical comparisons of the results obtained from applying the 
most stringent test, the x? test, which is now in general use, and the test derived 
from the principle of likelihood, have been made and the advantages of the most 
stringent test demonstrated. 

Tables and diagrams have been prepared to facilitate the use of the most 
stringent test in practice. The method of using these tables 


(1) to determine the size of sample necessary to test a particular hypothesis at 
a desired level of control of the total probability of error, and 


(2) to determine the regions of acceptance and total probability of errors 
involved in testing a particular hypothesis by means of a given sample, have been 
discussed and illustrated by the use of examples. 


In conclusion I wish to thank Dr J. Neyman for suggesting the problem, and 
for the great amount of help which he has given me in the course of the work. 


8. REFERENCES 


(1) J. Nevman and E. 8. Pearson (1928). “On the use and interpretation of certain test criteria 
for purposes of statistical inference”, Biometrika, vol. xx4, pp. 175-240 and 264-294. 

(2) J. Neyman and E. S. Pzarson (1933). “On the problem of the most efficient tests of statistical 
hypotheses”, Phil. Trans. Roy. Soc. A, vol. COXxXXI, pp. 289-337. 

(3) J. Neyman and E. S. Pearson (1933). “The testing of statistical hypotheses in relation to 
probabilities a priori”, Proc. Camb. Phil. Soc. vol. xx1x, part 4, pp. 492-510. 

(4) R. A. Fiser (1935). Statistical methods for Research Workers, § 20. 

(5) K. Pearson (1934). Tables of the Incomplete Beta Function. 

(6) K. Pearson (1922). Tables of the Incomplete Gamma Function. 

(7) Hastie Marsura (1931). Genic analysis in Avena. A Monograph, Sapporo, Japan. 


CAMBRIDGE: PRINTED BY 
WALTER LEWIS, M.A. 
AT THE UNIVERSITY PRESS 


= 


ch 


ti 


bl 


Power $ a5. ted. aa tables ae ‘per 
2Ps P. N. “NavER nee ges bern en 


some 5 cAueational ee By. PAEME 
J. NEYMAN — . Sea eae oy ae 


~ 


See | ‘tions with spec Petree fé athe =r 
ee Anetra — s ap ‘SUKHATME ec 


oe < Gs Tests of statistical a nee in the: case oe 
alternatives is discontinuous, illustrated on ‘som ger 
oe By Rozert WB. Jackson 


vs i 


issued about once a year. The M emotrs may be ee 


THE DEPARTMENT OF STATISTICS — 
UNIVERSITY COLLEGE =i 
GOWER STREET, LONDON, W.C. 1: ae 


Those who signify their intention to purchase a volum 
vance Pubre be regarded as Subscribers and will ta a co) 


they are rae in ae properly aan ee iid payable ma were 
agency. : | 


